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A data  flow  computer  ie  one  which  achieve!  enormout  concurrency  of 
inetruction  execution  through  n machine  architecture  that  acta  directly  on  e data  dependency 
graph  of  the  program.  To  handle  array!  and  data  atructurea  effectively,  a data  flow  computer 
mutt  have  accect  to  a memory  syitem  which  can  handle  large  number!  of  concurrent 
transaction!.  This  thesis  presents  a design  for  such  a memory.  A "cache”  mechanism  is 
presented  for  Improving  the  performance  of  the  system,  and  a mechanism  is  given  for  ueing 
sequential-access  devices  such  as  shift  registers  as  the  memory  medium.  The  memory  system 
design  uses  the  "packet  communication"  concept,  in  which  the  components  of  the  system 
communicate  only  through  the  transmission  of  fixed  size  "packets"  of  data. 
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0.0  INTRODUCTION 

A data  flow  computar  is  a machine  with  architecture  radically  different  from 
that  of  existing  computers.  It  can  perform  computations  simultaneously  on  many  different 
parts  of  a program.  A typical  data  flow  computer  has  many  arithmetic  processors,  and  can 
utilize  all  Of  them  simultaneously,  each  executing  a different  instruction. 

To  handle  arrays  and  other  data  structures,  a data  flow  computer  must  have  a 
data  structure  processing  facility  and  memory  that  has  a similar  facility  to  perform  many 
operations  concurrently.  Such  a data  structure  memory  is  the  subject  of  this  thesis. 

A data  flow  computer  owo;  its  great  speed  to  its  ability  to  perform  many 
operations  at  once,  even  though  each  individual  operation  is  no  faster  than  on  a conventional 
computer.  The  same  is  true  of  the  memory.  The  memory  to  be  presented  here  has  a 
retrieval  delay  just  as  great  as  conventional  memories,  since  nc  new  circuit  technology  will  be 
proposed.  However,  it  has  an  enormous  data  transfer  rats  because  of  its  ability  ic  handle 
concurrent  transactions.  This  concurrency  is  made  possible  by  an  unusual  type  of  interface 
ceiled  packet  communication. 

Section  1 of  this  thesis  is  an  overview  of  data  flow  computers  and  the  type  of 
memory  that  such  a computer  requires  for  structure  processing.  Section  2 is  a treatment  of 
packet  communication  systems,  showing  now  their  behavior  is  defined.  In  section  3 the  basic 
memory  unit  is  described,  along  with  a "cache"  mechanism  and  an  "interleaving"  method  to 
improve  its  performance.  In  section  4 an  implementation  of  the  memory  using  shift  registers 
or  magnetic  disks  will  be  given,  showing  how  the  disadvantages  of  such  devices  can  be 
overcome  through  tho  use  of  packet  communication.  Section  5 examines  some  aspects  of  She 
processing  unit  that  uses  the  memory,  and  section  6 examines  the  "deadlock"  problem  and  the 
cost  of  overcoming  it.  Setbon  7 presents  suggestions  for  fuiure  research. 
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1.0  DATA  FLOW  COMPUTERS 

As  tha  need  increases  for  ever  faster  computers,  one  technique  for  improving 
performance  that  has  drawn  considerable  interest  in  the  last  few  years  is  a radically  new 
design  Known  as  a data  flow  computer  [6]  [7]  [11]  [15] . A conventional  computer  has  only 
one  locus  of  control,  that  is,  one  point  in  the  program  a(  any  given  instant  at  which 
instructions  are  executed.  Ability  to  execute  more  than  one  instruction  at  « time  can  improve 
performance  significantly,  and  some  computers  use  an  instruction  lookahead  to  achieve  this  [3] 
[9] . However,  the  benefits  of  lookahead  methods  are  limited,  and  such  computers  are 
enormously  complex.  Other  attempts  to  increase  instruction  concurrency  include  "array 
processors"  [16] , but  such  machines  are  inflexible  and  extremely  difficult  to  program. 

A data  flow  computer  achieves  executional  concurrency  by  using  a different 
irternal  representation  of  the  source  program.  Instead  of  representing  the  program  as  a list 
of  instructions  to  be  executed  in  a particular  order,  the  program  is  represented  as  a data  flow 
schema.  A data  flow  schema  is  a directed  graph  whose  nodes  represent  instructions  and 
whose  arcs  show  the  data  dependence  among  instructions.  The  order  of  instruction  execution 
is  determined  solely  by  the  data  dependence  - an  instruction  is  executed  when  all  of  its  data 
sources  have  produced  results  and  all  of  its  destinations  are  ready  to  receive  data.  This 
allows  many  inst*  uctions  throughout  the  program  to  be  executed  simultaneously. 

The  data  ir.  a data  flow  program  can  ba  modalad  by  "tokens"  that  reside  on  the 
arcs  of  the  graph.  Each  arc  may  contain  at  most  one  token.  The  execution  rule  for  most 
instructions  is  as  follows: 

An  instruction  (other  than  a merge  or  gate)  is  ready  tor  execution  whenever  all 
of  its  input  arcs  contain  tokens  and  all  of  its  output  arcs  are  ompt/.  When  an 
instruction  is  executed,  the  tokens  on  the  input  arcs  are  absorbed  The 
function  denoted  by  ihe  instruction  is  computed,  using  the  values  in  the 
absoibed  tokens  as  input  daia  A token  containing  the  function  value  is  placed 
Or*  as  h output  ,»>  r 
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There  are  a number  of  ways  of  handling  decisions  and  iteration  control. 
Perhaps  the  simplest  is  the  use  of  special  instructions  M,  T,  and  F.  These  receive  a boolean 
value  on  one  input  (the  "control”  input)  and  use  it  to  contrc!  the  passage  of  data  from  another 
input.  Their  execution  rules  are  as  follows: 

The  M (merge)  has  a control  input  and  two  data  inputs  labelled  "T"  and  *F".  To 
be  ready  for  execution,  there  must  be  a boolean  token  on  the  arc  leading  to  its 
control  input.  Furthermore,  the  arc  baaing  to  whichever  of  its  T or  F input 
matches  that  boolean  token  must  have  a token,  and  all  output  arcs  must  be 
empty.  When  it  is  executed,  the  control  token  and  the  date  oken  at  the  input 
indicated  by  the  control  token  are  absorbed.  Copies  of  the  token  at  the 
selected  data  input  are  placed  on  each  output  arc.  Input  tokens  are  not 
required  at  the  non-selected  data  input,  and  if  any  are  present  they  are  not 
absorbed. 

The  T (true  gate)  and  F (false  gate)  instructions  have  a control  input  and  a data 
input.  They  are  ready  for  execution  whenever  both  input  arcs  contain  tokens 
and  all  output  arcs  are  empty.  When  they  are  executed,  the  inputs  are 
absorbed.  If  the  control  input  matches  the  name  of  the  instruction,  copies  of 
the  data  input  are  placed  on  tho  output  k a.  If  not,  no  tokens  are  placed  on 
the  output  arcs. 

Constants  can  be  generated  through  the  use  of  functions  of  no  arguments.  An 
instruction  to  perform  such  a function  has  no  input  arcs,  so,  in  accordance  with  the  execution 
rule,  it  places  tokens  on  its  output  arc  as  fast  as  they  are  removed. 
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Here  it  an  example  of  a data  flow  schamt  to  compute  the  factorial  function: 


Boolean  input*  to  M,  T,  and  F instructions  are  drawn  at  open  arrows.  Tokens  existing  in  the 
initial  configuration  of  the  program  cro  drawn  as  filled-in  circles. 

The  burttvlor  of  a data  flow  schema  under  tho  execution  rules  has  a very 
important  proper*y  - it  is  determinate.  This  means  that  tho  output  of  the  program  is 
determined  oHy  by  the  input,  and  is  independent  of  the  timing  of  instruction  executions.  All 
rum  of  such  a program  with  the  same  data  will  yield  the  same  results.  Determinacy  follows 
from  the  facts  that 

0)  Each  instruction  produces  a result  whi'-h  is  a function  vnly  of  the  values  of 
its  input  tokens,  that  is,  each  node  of  the  schema  is  determinate. 

(2)  The  value  of  a token  does  not  change  in  any  way  while  It  resides  on  an  arc. 

(3)  Ths  execution  rules,  and  fact  (2)  above,  qualify  the  schema  as  a valid 
interconnection  of  autonomous  communicating  systems. 
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It  is  an  established  result  that  sich  an  interconnection  of  determinate  systems  is  determinate 
[13 [14] . 

1.0.1  DATA  FLOW  COMPUTER  ARCHITECTURE 

The  memory  syctsm  and  structure  processor  that  are  the  subject  of  this  thesis 
are  intended  to  be  part  of  a computer  of  the  type  described  by  Dennis  and  Misunas  [S]  [7]  . 
Such  a computer  is  composed  of  units  which  use  packet  communication  [8]  for  transfer  of 
data.  The  rnly  means  of  data  transmission  among  thssa  units  is  the  transmission  of  fixed  size 
messages  called  packets.  There  is  no  clock  or  synchrony  j information. 

The  four  main  par’s  of  the  data  flow  computer  are  the  instruction  memory, 
arbitration  network,  functional  units,  and  distribution  network.  For  structure  processing,  the 
structure  controller  and  structure  memory  are  added. 


distribution  arbitration 

network  network 


To  execute  a data  tlow  program,  its  schema  is  encoders  into  the  instruction 
fr  eiriO' / lath  veil  c’t  the  memo' , . u ita.ns  or  <i  ■ jJrcut.cn  .jt  the  jcheme  «t  the  time  ihe 

p f f 4'* • f : ’ £*’  h?  1,  <*  4*  f ’ t*  . j 1 w - T ' t*  c k,  v • a ‘ • "'J«r  * ** 1 * f * { \ t ^ 


10 


structure  operation,  etc.)  and  the  address  of  its  destinations.  The  latter  are  the  cells  to 
which  outgoing  arcs  point.  The  instruction  cells  also  have  receiver  registers  to  contain 
incoming  "tokens".  When  all  necessary  receiver  registers  become  full,  an  instruction  cell  emits 
an  operation  packet,  consisting  of  its  operation  code,  the  dah  from  the  receiver  registers,  ond 
the  destination  addresses. 

Any  given  program  has  a great  number  of  instruction  cells,  each  sending 
operation  packets  only  occasionally.  These  streams  of  packets  are  merged  by  the  arbitration 
network  into  a small  number  of  dense  streams.  The  packets  coming  out  of  the  arbitration 
network  are  sorted  according  to  operation  code  and  sent  to  the  appropriate  functional  units. 
In  the  case  of  structure  processing  instructions,  they  are  sent  to  the  structure  controller. 
The  functional  units  or  structure  controller  perform  the  indicated  operation  and  form,  for  each 
destination,  a result  packet  consisting  of  the  destination  address  and  a copy  of  the  actual 
result.  The  result  packets  go  to  the  distribution  network,  where  they  are  sorted  by  address 
and  sent  to  the  appropriate  receiver  register  of  the  appropriate  instruction  cell.  (The 
destination  address  includes  the  receiver  number.)  If  the  instruction  is  a structure  operation, 
the  structure  controller  may  send  numerous  command  packets  to  the  memory  and  receive 
result  packets  back  during  the  course  of  its  computation. 

The  preceding  description  does  not  quite  implement  the  execution  rule:  An 
instruction  cel!  should  wait  until  its  "output  arcs",  that  is,  the  receivers  of  its  destinations,  are 
empty  before  issuing  an  operation  packet.  There  is  no  way  for  an  instruction  cell  to  "see"  its 
destinations’  receivers.  The  problem  is  remedied  by  using,  where  necessary,  acknowledgment 
tokens  sent  from  a cell’s  destinations  to  the  celt  itself.  The  acknowledges  are  treated  like 
invisible  arguments,  except  that  they  contain  no  data,  When  a cell  ij  executed,  it  may  send 
result  packets  to  some  destinations  and  acknowledges  to  others.  A cell  is  r.ot  ready  to  be 
executed  until  it  has  received  all  necessary  real  arguments  and  all  necessary  acknowledges. 
Acknowledges  are  placed  in  the  program  where  necessary  to  ensure  that,  when  a cell  Has 
received  ell  arguments  and  acknowledges,  its  destination*’  receive,  registers  w.il  be  empty. 
These  acknowledges  should  nut  be  contused  with  the  nacket  acknowledges  to  be  dea-oiuped 
later 
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A constant  need  not  be  implemented  at  a separate  node  of  the  data  flow 
schema.  It  can  simply  be  loaded  into  the  receiver  register  of  the  instruction  call  that  uses  it, 
and  marked  in  such  a way  that  the  instruction  ceil  knows  that  that  register  is  always  full. 

An  additional  part  of  the  data  flow  computer,  not  shown  in  the  preceding 
diagram,  is  the  host  computer.  This  is  a computer  of  conventional  design,  which  has  access  to 
the  memory  units  and  control  functions  of  the  data  flow  computer.  It  is  used  for  diagnostic 
testing  and  for  initial  loading  of  the  instruction  memory  and  structure  memory.  It  does  not 
participate  in  the  actual  data  flow  computation. 
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1.1  DATA  STRUCTURES 


In  order  to  handle  array*  and  data  structural  in  a data  flow  computer,  it  it  in 
most  cuat  necessary  to  allow  single  tokens  to  have  entire  structures  as  their  values.  (Some 
programs  which  use  arrays  of  fixed  size,  such  as  Fourier  transforms  and  other  signal 
processing  algorithms,  can  make  do  with  arrays  of  instructions  with  one  token  on  each  arc. 
However,  this  approach  is  impractical  for  very  large  arrays  or  for  dynamic  structures.)  For 


fhi$  reason.  w®  propane  e data  structure  facility  that  allows  tokens  to  have  structure  values. 
The  simplest  type  ot  structure  that  permits  fuii  generality  is  the  binary  tree,  which  is 
recursively  defined;  s winery  tree  is  sn  slsmsritsry  "object"  ?rom  some  set,  or  is  a 
concatenation  of  two  binary  trees.  Such  trees  form  the  basis  for  the  programming  language 
LISP.  [4]  [13]  For  definiteness,  the  structures  used  in  a data  flow  computer  will  be  assumed 
to  be  binary  trees. 


The  "elementary  objecta"  are  all  date  values  other  than  atructures  that  the 
computer  can  handle,  plus  the  special  object  nil.  Elementary  objecta  thus  might  include 
■ntwgeri,  boolean  values,  reals,  etc. 

The  principal  operation  on  a data  structure  is  selection.  A simple  selection 
takes  a structure  and  a single  bit.  If  the  structure  is  elementary  and  no!  nil,  the  result  of  the 
selection  is  undefined.  If  the  structure  is  nil,  the  result  is  ni[.  Otherwise,  the  structure  is  the 
concatenation  of  two  structures,  and  the  result  of  the  selection  is  the  first  or  second  of  these 
if  the  bit  is  zero  or  one  respectively.  A compound  selection  takes  a structure  ano  a string  of 
bits,  and  gives  the  result  of  applying  simple  selections  repeatedly,  using  the  bits  in  sequence. 
The  bit  string  is  called  the  selector.  Let  S be  the  following  structure: 
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SELECTS,  '1*3 » 5 (a  simple  selection) 

SaeCTIS,  HJOr]  - SELECTTSELECTISELECTIS,  *0’1  •O’l  T]  - 4 (a  compound  selection) 

The  true  "meaning”  or  "Value"  of  a structure  can  be  defined  to  be  the  set  of 
ordered  pairs  of  selectors  that  yield  elementary  values  other  than  nit,  along  with  those  values. 
Thus  the  structure  S denotes  the  set 

{ <’000’,  1>,  <W,  4>,  <H311%  3.14>,  <*1’,  5>  } 

Nil  simply  denotes  a substructure  with  no  elementary  items  at  all. 

Using  this  definition  of  the  meaning  of  a structure,  there  is  a structure 
corresponding  to  any  finite  set  of  ordered  pairs  of  selectors  and  elementary  values  (excluding 
nil)  such  that  no  selector  in  the  set  is  an  initial  substring  of  another.  The  structure  ni[ 
denotes  the  empty  set. 

SElECT(struc,  set]  - 


The  elementary  value  v if  struc  contains  the  pair  <sel,  v> 

Undefined  if  <s,  v>  e struc  where  s is  a proper  iniiial  substring  of  sel 
The  structure  { <$,  v>  1 <$«!»$,  v>  € struc  } otherwise 
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Structures  can  be  built  with  the  append  operation.  APPEND  places  a given  object  (structure 
or  elementary  value)  OTito  a given  structure  with  a given  selector,  removing  whatever 
substructure  previously  existed  there.  In  the  set-theoretic  model, 

APPENOCsiruc,  new-val,  $s!}  » 

/ 

(st'uc  - { <$,  v>  | one  of  set  or  s is  an  initial  substring  of  the  other})  U { <sel,  new-val>  } 
if  new-val  is  elementary. 

(struc  - { <s,  v>  | one  of  sel  or  s Is  an  initial  substring  of  the  other})  U 

{ <sel*s,  v>  | <s,  v>  e new-val}  if  new-val  is  a structure,  including  nil. 

Letting  S be  the  structure  defined  previously,  APPEND[$,  7,  *01*]  is 


The  substructure  containing  nil,  and  3.14  disappears. 

1.1.1  REPRESENTATION  IN  MEMORY 

Structure  can  be  implemented  on  a lata  flow  computer  in  the  some  way  that 
they  arc  commonly  implemented  on  ordinary  computers  - as  linked  lists  of  "cells"  in  a memory. 
An  elementary  ou^d  }s  represented  by  the  object  itseif.  A concatenation  is  represented  by 
She  address  in  memory  of  * :«ll  containing  the  representations  of  the  two  substructures.  In 
either  crrse,  a structure  is  represented  by  a s.*tJI  amount  of  information.  The  huge  amount  of 
information  that  constitutes  the  structure  itself  iu,  insida  the  memory,  and  the  representation 
is  merely  a pointer  to  this.  Tne  operation  of  selection  is  .'»ife  simple.  Cells  are  read  from 
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memory  and  the  appropriate  halves  of  the  data  used,  under  control  of  the  selection  bits. 
1.1.2  SHARING 


Such  an  implementation  leads  to  the  possibility  of  a single  structure  in  memory 
being  shared  (or  partly  shared)  by  several  parts  of  the  computation.  In  a data  flow  computer, 
two  tokens  might  have  the  same  pointer  aa  their  value.  This  is  of  course  very  desirable  for 
economical  memory  use,  but  it  makes  the  APPEND  operation  difficult.  The  problem  is  that 
modification  of  pointers  inside  the  memory  can  change  the  value  of  structures  other  than  the 
intended  one,  if  structures  have  parts  in  common.  In  many  programming  languages,  this  is 
considered  a reasonable  and  even  desirable  effect.  For  example,  the  LISP  language  has 
instructions  to  modify  existing  structures.  In  a data  flow  computer,  however,  this  cannot  be 
permitted  for  reasons  of  determinacy.  In  order  for  a data  flow  computer  to  be  determinate, 
the  meaning  (in  the  set-theoretic  sense  given  previously)  of  a token  bearing  a structure  value 
must  not  change  while  that  token  resides  on  an  arc.  Since  other  instructions  including 
APPEND’s,  can  be  executed  while  a token  resides  on  an  arc,  APPEND  must  never  change  any 
substructures  that  are  shared  with  other  structures. 

In  the  proposed  structure  processing  facility,  each  cell  has  a reference  count 
which  makes  it  easy  to  tell  what  substructures  ore  shared.  Whenever  the  APPEND  processor 
Is  tempted  to  modify  a ceil  that  is  shared  with  another  structure,  it  makes  a copy  of  the  ce>! 
and  modifies  the  copy  instead.  For  example,  if  S is  a pointer  to  the  following  structure  in 
memory: 


yssNr  in  eaih  node  >$  tne  reference  count,  APPENDFS,  7,  *01’ ] yields 


wbsrfe 
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The  nod«  that  originally  had  a reference  count  of  two  may  not  be  modified,  to  a copy  is  maJe, 
end  its  reference  count  Is  therefore  reduced  to  one.  The  structure  controller  to  be  described 
in  the  next  section  will  perform  these  tasks. 


$ 


ir 


v 
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1.2  THE  STRUCTURE  CONTROLLER 

In  this  section  wo  will  outline  the  behavior  of  a processing  mechanism  that  uses 
the  structure  memory  to  provide  a structure  facitity  for  the  data  flow  computer.  The  basic 
behavior  of  the  structure  controller  is  that  it  receives  operation  packets  from  the  arbitration 
network  and  delivers  result  packets  to  the  distribution  network,  it  holds  the  state  information 
for  strucfui  e operations  in  progressj  and  performs  memory  operations  by  sending  packets  to 
the  memory  and  receiving  packets  in  return 


f:  The  purpose  of  this  section  is  to  show  how  the  structure  controller  will  use  the 

'f 

i memory,  rather  than  to  give  a detailed  specification  for  the  structure  controller.  Therefore,  a 

} number  of  design  decisions  will  be  made  arbitrarily.  For  tho  most  part,  the  requirements  of 

*•;  the  structure  memory  are  independent  of  these  decisions.  For  example,  the  memory  design 

would  not  change  if  ternary  trees  were  used  instead  of  binary  ones. 


Some  aspects  of  the  design  of  the  structure  controller  will  be  considered  in 
more  detail  in  section  5. 

1.2.1  DATA  FORMAT 

The  memory  space  is  divided  into  "words"  or  "cells",  each  of  which  holds  one 
node  of  a structure.  Since  the  momory  is  used  for  the  storage  of  binary  trees,  the  words 
representing  nonterminal  nodes  contain  two  pointers  to  other  nodes.  The  convention  will  be 
made  that  all  words  of  the  memory  will  be  divided  into  helves,  called  the  left  half  and  the  right 
half.  Each  half  has  an  "elem"  bit  bit  indicates  whether  it  contains  an  elementary  item  {terminal 
node)  or  a pointer  to  another  word  in  the  memory.  If  the  bit  is  1,  the  half  word  contains  an 
elementary  value.  Tho  interpretation  of  that  half  word  is  then  the  exclusive  responsibility  or 
the  rest  of  the  computer,  unless  it  is  niL  The  structure  controller  treats  any  elementary  value 
other  than  ni[  simply  as  a collection  of  bits.  Any  type  information  (integer,  floating  point 
number,  character,  etc.)  must  be  encoded  Into  the  half  word  along  with  the  data. 


The  structure  graphically  represented  as  follows: 


l . 
' 

[ 


•^fSP^^JW'So*^ S3£*55SW«v 
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might  be  realized  by  address  102  in  the  following  memory  configuration: 


location  102 


o; 


107 


loeahon  107 


The  bit  at  the  left  end  of  each  half  word  i»  the  "elem"  bit 

(A  deferent  convention  could  be  used,  in  which  each  elementary  value  takes  an 
entire  word  instead  of  haif  a word.  The  two  conventions  ara  equally  powerful,  and  differ  only 
slightly  in  execution.  The  "half  word*  convention  will  be  used  for  definiteness.) 

1.2.2  MEMORY  MANAGEMENT  AND  GARBAGE  COLLECT  ION 

All  words  of  memory  that  are  not  part  of  a structure  are  kept  in  a collection  of 
free  storage  lists.  (There  are  several  such  fists,  rather  than  one,  in  order  to  maintain  a high 
rate  of  processing.  This  poin:  will  be  discussed  in  section  5.0.5. i Whenever  the  structurs 
controllor  needs  a word  in  order  to  create  a node,  it  takes  it  from  one  of  the  lists.  Whenever 
a node  is  destroyad,  that  is,  aii  pointers  to  it  disappear,  the  word  containing  it  is  returned  to  a 
free  storage  list. 


Each  node  of  a structure  has  a reference  count,  which  is  the  number  of 


pointers  to  that  node  that  exist,  whether  in  other  nodes  or  in  the  rest  of  the  computer.  (The 
tatter  includes  operands  waiting  in  instruction  ceils  and  packets  in  transit  through  the 
arbitration  and  distribution  networks.)  The  structure  controller  increases  or  decreases  the 
reference  count  of  each  node  as  pointers  to  it  ere  created  and  destroyed.  Whon  the 
reference  count  is  decreased  to  zero,  the  node  disappears,  so  it  is  returned  to  a free  storage 
list.  Whenever  this  happens,  any  pointers  that  the  node  contained  disappear,  and  so  the 
reference  counts  of  the  nodes  pointed  to  must  be  decreased. 

The  choice  of  a reference  count  strategy  for  memory  management  instead  of 
tho  "mark  and  scan"  method  commonly  used  In  LISP  systems  was  made  for  three  reasons: 

(1)  The  mark  and  scan  method  requires  a garbage  collection  operation  which 
must  find  every  reference  to  every  structure.  Since  references  exist  in 
packets  in  transit,  it  would  be  necessary  to  stop  the  entire  computation  znd 
wait  until  all  packets  ctop  moving  before  a garbage  collection  commances. 

(2)  The  reference  count  is  needed  anyway  in  order  to  implement  the  copying 
rule  efficiently.  Whenever  the  structure  controller  needs  to  modify  a node 
as  part  of  an  APPEND  operation,  it  may  do  so  safely  if  the  reference  count 
iss  one.  If  not,  the  node  must  be  copied. 

(3)  The  objections  to  the  reference  count  method  in  many  list  processing 
systems,  that  it  is  difficult  to  recover  circular  lists,  dees  not  apply  here. 
Because  of  the  copy  rule,  circular  lists  are  never  created. 

1.2.3  THE  STRUCTURE  OPERATIONS 

The  structure  controller  to  be  proposed  implements  the  following  program  level 

operations: 
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SEULCTfstructure,  selector)  - The  selector  is  a bit  string  of  definite  length.  The 
structure  is  traced  under  control  of  the  bits  in  the  selector,  starting  with 
the  leftmost  bit.  A zero  bit  selects  the  left  offspring  and  a one  hit  selects 
the  right.  The  item  at  the  selected  point  ih  the  structure  is  returned, 
whether  it  is  elementary  or  a substructure. 

APP£ND(structure,  object,  selector)  - Returns  a structure  similar  to  the  given 
one,  but  having  <he  object  at  the  place  specified  by  the  selector.  Whatever 
was  at  that  place  in  the  original  structure  is  absent  in  the  result.  Th~ 
object  may  be  elementary  or  a structure.  Any  part  of  the  original  structure 
that  is  shared  with  other  parts  of  the  computation  is  not  modified.  The 
controller  copies  part  or  all  of  the  original  structure  as  necessary  to  be 
sure  that  this  is  the  case. 

The  structure  controller  recognizes  the  special  constant  nH  which,  while 
elementary,  is  also  the  structure  with  no  selectors.  Ni[  is  used  as  a terminal  node  of  a 
structure  to  indicate  that  there  are  no  objects  beyond  that  point.  Any  pari  of  a structure 
may  be  defeted  simply  by  using  the  APPEND  operation  to  replace  it  with  nil,  and  a structure 
may  be  created  by  appending  something  to  nH.  It  is  assumed  that  the  constant  nH  is  explicitly 
available  to  the  programmer  for  these  purposes.  The  controller  optimizes  all  structures, 
replacing  with  nH  any  substructure  all  of  whose  terminai  nodes  are  nil. 

There  are  two  more  operations  performed  implicitly  by  the  controller.  If  any 
operation  returning  a structure  value  specifies  more  than  one  destination,  the  reference  count 
of  the  result  must  bo  appropriately  increased.  Also,  if  any  operation  discards  a structure 
value,  the  reference  count  must  be  decreased.  It  follows  that  the  conditional  operations  such 
as  and  false  actors  must  be  executed  by  a structure  controller  if  the  objects  being 
switched  are  structures. 
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1.2.4  THE  MEMORY  OPERATIONS 

The  structure  controller  communicates  with  the  memory  by  sending  command 
packets  and  receiving  result  packets.  These  packets  are  given  names  describing  the 
operation  to  be  performed. 

To  reed  a word  of  memory,  a FET  {'fetch")  packet  is  sent,  giving  the  address. 
Ths  “.SmOry  returns  a LOAD  packet  with  ihe  data.  Between  the  FET  and  the  corresponding 
LOAD,  many  other  packets  might  be  sent  snd  received.  This  is  a consequence  of  the 
parallelism  of  ihe  daia  flow  computer:  just  as  with  the  other  functional  units,  the  rate  at 
which  structure  operations  are  performed  can  be  increased  by  allowing  many  operations  to 
be  in  progress  simultaneously.  This  concurrency  is  made  possible  by  the  use  of  packet 
communication  at  the  memory  interface.  The  FET  packet  that  begins  an  oper*t!on  *»nd  the 
LOAD  packet  that  ends  it  are  distinct  events  and  might  be  separated  by  a great  number  of 
other  packet  transmissions  and  receptions.  Each  LOAD  packet  is  identified  with  the  FET 
packet  that  caused  it  by  means  of  the  "tag",  to  be  described  later. 

Each  LOAD  packet  contains  the  eddress  of  the  word  and  its  reference  count,  as 
well  as  the  data.  The  address  is  probably  not  used  by  the  structure  controller,  but  is  included 
«s  part  of  the  specification  of  the  memory  module  because  it  is  needed  by  the  cache 
mechanism  to  be  described  in  section  3.2.  The  structure  controller  uses  the  reference  count 
in  order  to  tell  when  a node  may  be  written  on  without  heing  copied  (if  count  ■ 1)  and  when 
a node  should  bo  destroyed  {if  count  - 0). 

To  increase  or  decrease  the  reference  count  of  a word,  the  FLTV  or  FET" 
packets,  respectively,  are  sent.  These  are  similar  to  FET,  except  that  the  reference  count  is 
first  modified.  The  memory  replies  to  them  with  LOAD*  or  LOAD"  packets  which  are  similar  to 
LOAD  packets.  In  some  cases  the  structure  controller  does  not  use  the  data  in  a LOAD*  or 
LOAD*  packet,  but  it  does  not  really  cost  anything  for  the  memory  to  send  it. 

To  write  on  a word  of  memory,  the  structure  coniroiler  sends  an  UPD 
{"update")  packet  giving  the  address,  data,  and  reference  count.  The  reference  count  is 
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presumably  one,  but  the  specification  of  the  memory  module  allows  an  arbitrary  count  to  be 
given.  (In  an  actual  implementation  of  a structure  controller  and  memory,  unnecessary  fields 
would  be  omitted  where  possible,  so  that  the  controller  would  not  send  a reference  count  in 
UPD  packets  or  receive  an  address  in  LOAD,  LOAD+,  or  LOAD"  packets.)  The  memory  sends  no 
reply  to  an  UPD  packet. 


There  is  another  command  that  the  memory  recognizes.  The  CLR  packet  waits 
until  all  pending  operations  on  the  given  word  are  complete,  and  then  returns  a DONE  packet. 
It  is  not  used  by  the  structure  controller  at  all,  but  is  required  for  operation  of  the  cache. 


1.25  THE  TAG  FIELD 

Every  FET,  FET\  or  FET"  packet  has  a field  called  the  "tag"  field  that 
constitutes  a reminder  from  the  structure  controller  to  itself,  telling  it  what  to  do  with  the 
result  of  the  operation.  The  tag  field  of  a command  packet  is  returned  unchanged  in  the 
result  packet. 


Consider  the  case  of  a simple  SELECT  instruction.  When  the  instruction  cell 
fires,  an  operation  packet  goes  to  the  structure  '•ontroller  containing  the  operation  code,  the 
structure,  the  selector,  and  the  addresses  of  the  the  instruction  cells  which  are  to  receive  the 
result.  There  might  typically  be  three  such  destination  addresses,  each  about  20  bits  long. 
The  structure  controller  can  simply  make  them  the  teg  field  of  the  "fetch"  command  to  the 
memory,  and  then  use  them  when  they  come  back  in  the  result  packet.  In  the  case  of  more 
complicated  structure  operations,  such  as  APPEND’s  with  compound  selectors,  there  is  a large 
amount  of  state  information  that  must  be  remembered  through  the  many  memory  transactions 
that  make  up  the  structure  operation.  In  addition  to  the  destination  addresses,  there  is  the 
datum  to  be  appended,  the  structure  to  be  ultimately  relurned,  the  remaining  selector  bits, 
and  a few  pointers.  The  total  amount  of  such  data  typically  might  be  200  bits  or  more. 

There  are  two  ways  of  handling  this  information.  One  method  is  to  include  all 
of  it  in  the  tag  field  of  commands  to  the  memory,  so  the  structure  cor't  oiler  doesn’t  need  to 
store  any  information  about  the  state  of  ongoing  structure  operations.  When  the  result 
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packet  comes  back  from  the  memory,  the  structure  controller  locks  at  the  entire  packet 
including  the  tag  field,  decides  what  to  do  next,  and  produces  a new  packet  to  send  back  to 
the  memory.  This  method  (the  "memoryless  structure  controller"  method)  is  efficient,  but  it 
requires  an  extremely  wide  data  path  for  all  memory  transactions,  and  it  gives  rise  to  very 
difficult  problems  of  avoiding  deadlocks. 

A second  method  is  to  store  all  of  the  state  Information  in  the  structure 
controller.  This  requires  that  the  controller  have  a memory  with  a capacity  of  200  bits  or 
more  for  every  structure  operation  that  can  be  in  progress  2t  one  time.  In  this  caso  only  the 
address  of  the  block  of  memory  in  which  the  state  information  is  stored  must  be  put  in  the  tag 
field.  If  256  simultaneous  structure  operations  are  allowed,  the  tag  field  only  needs  to  be  8 
bits. 


In  either  case,  commands  to  the  memory  contain  a tag  field.  The  memory 
echoes  the  tag  bea  to  the  controller  in  the  result  packet. 

1.2.6  THE  DATA  AND  REFERENCE  COUNT  FIELDS 

The  contents  of  each  memory  word  consists  of  a data  field  and  a reference 
count  field.  The  data  field  is  further  divided  into  two  pointer  fields,  leaf-node  indicator  bits, 
perhaps  a bit  to  indicate  that  the  cell  is  on  the  free  storage  list,  and  perhaps  type  indicator 
fields  for  elementary  values.  All  of  these  are  significant  only  to  the  structure  controller,  and 
ore  irrelevant  to  the  memory.  The  memory  can  simply  consider  the  data  to  be  a homogeneous 
field.  In  practice,  it  might  be  about  40  to  80  bits  long. 

From  the  memory's  standpoint,  the  reference  count  is  simply  part  of  the  data 
associated  with  each  word.  In  some  transient  cases  it  might  become  negative  in  some  parts  of 
the  memory  system,  although  the  structure  controller  will  niver  see  a negative  reference 
count.  In  a typical  realization,  the  reference  count  field  might  »>e  about  8 to  i5  bits  long. 

Incoming  and  outgoing  packets  that  road  or  write  a word  of  memory  have  dat3 
and  reference  count  fields  that  correspond  precisely  to  the  fields  in  memory. 


2.0  SPECIFICATIONS  OF  PACKET  SYSTEMS 


In  this  section  we  will  deveiop  methods  by  which  one  may  describe  how  a 
hardware  system  using  the  packet  communication  principle  is  constructed,  how  such  a system 
behaves,  and  how  oiw  may  prove  that  a system  constructed  in  a certain  way  behaves  in  a 
certain  way.  Examples  will  be  given  of  simple  systems  that  illustrate  some  of  the  important 
points  of  the  design  method. 

2.0.1  FUNCTIONAL  SPECIFICATIONS 

Because  of  the  restricted  wa;  in  which  packet  communication  systems  interact 
with  their  environment,  it  is  easy  to  describe  how  such  a system  might  behave.  Since  the 
only  interaction  is  through  packets,  a system's  behavior  is  completely  known  if  it  is  known 
whet  packets  it  will  transmit  in  response  to  whatever  packets  are  sent  to  it.  One  other  piece 
of  information  that  might  be  available,  but  that  we  reject,  is  the  time  when  a packet  is 
transmitted.  It  is  impermissible  for  a system  to  be  described  as,  for  example,  transmitting  the 
result  of  a computation  between  1 and  1.5  microseconds  after  it  receives  the  data.  The  only 
requirement  is  that  it  eventually  produce  the  result.  (This  is  not  to  say  that  speed  is 
unimportant.  Like  any  other  computer,  a data  flow  computer  is  designed  with  operating  speed 
in  mind.  The  conditions  for  correct  behavior  are  Independent  of  speed,  however.  If  any 
component  of  8 data  flow  computer  is  replaced  with  one  that  operates  at  a different  speed, 
the  computer  will  continue  to  function  correctly.) 

Since  a module  of  a packet  communication  system  may  retain  internal  state 
information  (though  many  useful  modules  do  not),  the  "result"  packets  that  it  transmits  may 
depend  not  just  on  individual  input  packets,  but  on  the  entire  history  of  input  packets.  All 
packets  that  pass  through  a given  input  or  output  port  have  a definite  order  among 
themselves.  Tho  ordered  sequence  of  packets  that  have  passed  through  a port  from  the  time 
that  the  system  was  started  up  until  a given  instant  is  the  history  of  the  port  at  that  instant. 
A histo. , of  e port  will  be  written  by  listing  the  packets  in  parentheses,  separated  by 
semicolons. 
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There  is  e partial  order  on  histories:  X ^ Y if  X is  an  initial  subsequenca  of  Y. 

For  example: 


(1 1 3 ; 4)  < <1  > 3 ; 4 j 7> 

but  (1 ; 2 > 4)  and  (1  t 3 t 4)  do  not  satisfy  this  relation  in  either  order. 

Since  histories  only  grow  longer  as  time  progresses  and  symbols  already  in  a 
history  never  change,  a history  at  a later  instant  is  always  greater  than  or  equal  to  a history 
at  an  earlier  instant. 

The  length  of  port  history  X is  denoted  |X|.  The  individual  packets  of  X are 

Xj  , Xj  . . . Xpq. 

There  is  no  defined  time  order  among  packet  arrivals  on  different  ports,  so  it  is 
useless  to  represent  them  as  a single  sequence,  instead,  a history  array  is  used,  which  is  a 
collection  of  histories,  one  por  port.  The  partial  order  on  histories  can  be  extended  to  arrays: 
A > B if  each  history  of  A is  greater  than  or  equal  to  the  corresponding  history  of  B.  Like 
histories,  history  arrays  increase  as  time  progresses. 

The  description  of  how  a system  is  expected  to  behave  is  quite  simple.  II  is  a 
description,  for  every  Input  history  array,  of  what  output  history  array  the  system  will 
evontually  produce.  "Eventually"  means  in  finite  time  for  finite  histories.  For  infinite 
histories,  it  means  that,  for  any  K,  the  first  X packets  will  be  produced  in  finite  time.  This  is 
because  a system  which  is  expected  to  have  an  infinite  output  history  cannot  ever  transmit  its 
entire  output  in  finite  time. 

A description  of  th®  dependence  of  output  history  arrays  on  input  arrays  is 
ceiled  a functional  specification,  it  is  a description  of  how  a system  is  expected  to  behave. 
The  major  problems  In  th®  field  ot  packet  communication  systems  are  proving  that  a system 
built  in  a certain  way  obeys  a certain  functional  specification,  and  proving  that  the 
interconnection  of  systems  Known  to  obey  certain  functional  t reifications  obeys  some  other 
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functional  specification. 

If,  for  any  input  array,  the  functional  specification  states  that  there  is  only  one 
possible  output  array,  the  system  is  determinate  (sometimes  called  functional,  but  that  term 
will  not  be  used  here),  in  that  case  there  is  a function,  say  f,  mapping  input  arrays  to  output 
arrays,  such  that,  if  input  X (anti  no  more)  is  delivered  to  the  system,  f(X)  will  eventually  be 
produced.  If  further  input  is  then  given,  the  input  history  is  Y with  Y > X,  and  output  history 
f(Y)  will  be  produced,  Since  the  system  cannot  retract  any  of  its  previous  output,  f(Y)  > f(X). 
From  this  it  is  easy  to  see  that  i is  monotonlc  in  that: 

X > Y * f(Y)  > f(X) 

If  there  is  more  than  one  legal  response  to  a given  input  array,  the  system  is 
nondeterminate.  In  that  case  a function  is  also  used  to  define  the  functional  specification,  but 
f(X)  is  the  se£  of  all  legal  output  history  arrays.  Functions  defining  the  specifications  of 
nondeterminate  systems  also  obey  a sort  of  monotonicity  property,  which  will  be  given  later. 

It  is  possible  for  an  interconnection  of  nondeterminate  systems  to  be 
determinate.  For  example,  a data  flow  computer  is  determinate  even  though  its  arbitration 
network  is  not.  An  interconnection  of  determinate  systems  is  always  determinate,  and  its 
function  can  be  computed  explicitly  from  the  functions  of  the  components  [1] . 

2.0.2  DESCRIPTIVE  SPECIFICATIONS 

Since  a major  task  of  the  system  designer  is  to  demonstrate  that  a system  built 
in  a certain  way  obeys  certain  functional  specifications,  it  is  necessary  to  describe  in  a 
reasonably  formal  way  how  a system  is  built.  A wiring  diagram  is  one  formalism,  but  it  is  far 
too  rigid  and  implementation-dependent.  A higher  level  method  is  needed.  When  a system  is 
assembled  from  components,  all  using  the  packet  communication  principle,  it  is  of  course  easy 
to  describe  the  interconnection,  telling  whst  ports  of  the  various  systems  are  connected  to 
each  other.  For  systems  that  cannot  ha  so  decomposed,  the  descriptive  specification  will  be 
given  in  terms  of  a program  written  in  an  extremely  informal  ALGOL-like  language.  This 
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language  is  a subset  of  the  Architecture  Description  Language  [10]  which  is  under 
development. 


In  the  language  we  will  use  for  giving  descriptive  specifications,  packets  will 
look  like  data  records  with  a title  and  one  or  more  data  fields,  for  example:  "WRITE(3,  7)". 
This  format  is  purely  cosmetic.  In  the  actual  hardware  implementation,  a packet  is  nothing  but 
a collection  of  bits.  The  fields  are  simply  divisions  of  these  bits  into  subsets  that  the  sender 
and  receiver  both  agree  upon.  The  titles  are  just  encodings  of  another  field. 

2.0.3  AN  EXAMPLE  OF  A DETERMINATE  MEMORY 

A functional  and  descriptive  specification  of  a system  called  MEM  will  now  be 
given.  MEM  is  a random  access  memory  with  an  input  port  IN  and  an  output  port  OUT.  Two 
types  of  packets  may  be  delivered  to  it: 

WRITE(addr,  data)  writes  the  data  into  the  given  address 
READ(addr)  fetches  the  data  from  the  given  address 

The  "addr"  and  "data*  fields  contain  numbers  that  range  over  some  finite  3nd  fixed  spaces. 
There  is  one  output  packet  type: 

RTRfaddr,  data) 

{RTR  stands  for  "retrieve") 

Every  READ  packet  delivered  to  MEM  results  in  transmission  of  a RTR  packet 
bearing  ihe  address  and  the  current  contents  of  the  memory.  Every  WRITE  packet  stores  its 
dsta  in  the  memory  and  returns  no  result  packet.  The  initial  contents  of  each  address  of  the 
memory  is  zero. 

For  a given  Input  history,  the  contents  of  the  memory  may  be  easily 
determined.  The  contents  of  each  word  is  simply  the  data  field  of  ihe  last  WRITE  packet 
having  that  address,  o'  zero  if  there  is  no  such  packet,  The  function  realized  by  this 
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memory  Is: 


Wntetion:  WRJTECaddr,--)  means  any  WRITE  paeKet  having  the  specified  adar  rse'id  *nu 
anything  at  ail  In  the  data  field. 

A functions*  specification  of  MEM  simply  consists;  u<  oieilng  (fiat  r«i.tv»  rsaliges 
ftjru  , that  is,  that  is  ths  input  history  X is  presented  to  it,  it  will  eventually  transmit  output 
history  fj^X). 

This  specification  says  nothing  explicit  about  the  states  of  MEM.  This  is  a basic 
property  of  the  history  'unction  approach  to  system  specification  - even  for  a device  whose 
purpose  is  to  have  states,  such  ss  a memory,  the  specification  does  not  mention  the  states.  Of 
course,  ths  memory  dogs  have  states,  and  ths  state  is  a function  ef  the  input  history.  Since 
the  input  history  records  ail  of  the  information  tha-  has  avgr  gone  into  ths  system,  it  contains 
enough  information  to  determine  tha  stats. 

We  now  show  how  the  system  MEM  may  be  buiit:  The  system  uses  a re 
random  sccoss  memory,  with  a capacity  of  one  word  for  each  poss-hfa  value  of  the  “addr* 


f, 


MEM 


If  X * input  history  and  Y * output  history, 
f^XX)  * Y where 

|Y|  » the  number  ef  occurrences  of  REACX— ) in  X 


r 


Y,  * < 


RTRtaddr,  data)  if  the  ith  READf--)  in  X is  REAOtaddr) 

and  the  lest  WRITE(*ddr,~>  in  X before  that  READ 
ic  WRITEtsddr,  data),  if  there  is  such  a WRITE 


RTRfaddr,  0?  if  the  Ith  READf--}  in  X is  READfaddr) 
but  there  is  no  WRITEfaddr,—)  before  it 
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field  of  incoming  packets.  We  choose  some  obvious  correspondence  between  the  values  of 
ihs  "addr"  field  and  word  addresses.  Each  word  can  contain  any  of  the  possible  values  of  the 
“'data"  field  of  incoming  WRITE  packets.  We  choose  some  obvious  correspondence  here  also. 
The  memory  is  initialized  with  all  words  containing  zero. 

The  algoriti  of  tha  implementation  of  MEM  is  as  follows:  If  a packet 
WRITE(addr,  data)  1$  received,  the  data  field  is  written  into  memory  at  the  word  address  given 
by  ihs  addr  field.  If  a packet  READfaddr)  Is  received,  the  word  at  the  appropriate  address  is 
nondestructive^  read,  and  a packet  RTRfaddr,  data)  containing  the  data  fetched  from  memory, 
Is  returned 


This  system  may  be  implemented  by  the  program  which  follows.  "Memory*  is 
en  array  which  represents  the  actual  memory. 


process  starts  at  A 
input  port  IN 
output  port  OUT 
var  command,  addr,  data 
array  memory  Inii,  0 

| wait  for  input 

A:  until  packet  is  available  at  IN  doj 

command  :»  packet  from  port  IN; 

t ar.slyze  input  packet 

if  command  - REACX— ) then 
jet  command  • READfmidrh 
send  RTR(addr,  memory(addr))  at  port  OUT 
else 
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tot  command  * WRITEfaddr,  data); 
memoryfaddr) :«  data; 

goto  A 


Notes: 

(1)  The  statements  for  receiving  and  transmitting  packets  are  excessively  primitive.  Slightly 
improved  versions  will  be  presented  later. 

(2)  The  expression  RTR(addr.data)  means  "a  RTR  packet  whose  fields  are  filled  with  the 
current  values  contained  in  addr  and  data". 

(3)  The  in  conditionals  has  its  usual  meaning.  "If  packet  « WRITE(3r-)"  means  “if  packet 
is  a WRITE  packet  whose  first  field  is  3". 

(4)  The  "jet  packet  • pattern"  statement  is  an  assignment  statement  that  sets  the  variables 
appearing  in  the  pattern  to  have  the  values  of  the  corresponding  fields  of  the  packet,  “let 
thing  - WRITE(eddr,— )"  means  "if  the  type  of  thing  is  not  WRITE,  it  is  an  error;  otherwise 
set  addr  to  the  first  field  of  thing  and  ignore  the  second  field”. 

We  now  prove  that  this  implemema'u^i  satisfies  thu  specification  fMEM  . First,  we  noed  to 

show  that  the  memory  state  equals  the  system  state  (as  defined  by  the  input  history)  under 

the  following  correspondence: 

For  all  X,  the  contents  of  memory  address  X for  a given  input  history  is 

zero  if  the  input  history  contains  no  packets  WRITE(X,--) 

Y if  the  history  does  contain  such  packets,  and  the  last  is  WRITE(X,Y) 

Proof  by  induction  on  the  length  of  the  history  at  port  iN.  For  length  zero,  all  cells  contain 

zero  by  in.  'ialization,  and  the  history  contains  r.o  WRITE  packets  at  all.  Otherwise  assume 


true  for  any  history  of  length  K and  prove  it  for  K+l. 


If  - REAIX— ),  nothing  was  written  into  memory  between  receipt  of  INK 
end  IN^j  , so  tho  memory  stats  did  not  change.  The  existence  of  WRITE(— ,~)  packets  did 
not  change  either. 

If  INK^  - WRITEfaddr,  data),  no  memory  cell  other  than  addr  changed,  and  the 
existence  of  WRITE(X,--)  packets  did  not  change  for  X * addr.  The  contents  of  memory  cell 
addr  is  now  data,  and  the  last  WRITE(addr,~)  In  the  history  is  now  obviously  WRITE(addr, 
date). 


Next,  we  prove  correctness  of  the  implementation.  If  the  input  history  * X,  we 
will  show  that  fy^/X)  will  appear  at  the  output.  This  proof  is  also  by  induction.  If  |X|  - 0, 
’MEM  " But  the  implementation  specifies  no  output  except  in  response  to  input.  Now 
suppose  X'  - x,Xj,  _ * Let  X - XjX2  - xN  . By  induction,  fMEM(X)  appeared  at  the 

output  when  X was  the  input  history.  When  x^j  arrived,  the  system  transmitted  no  output  if 
was  e WRITE,  and  transmitted  RTR(addr,  memory(addr))  if  xNt,  was  REAIXaddr). 
Therefore  the  response  to  X’  is 

/ 

fMEfc/X)  concatenated  with 

s' 

< if  x^,  - WRITE(— ,--) 

RTRfaddr,  memory(addr))  If  x^j  - REAIXaddr),  where  the  memory 
state  is  that  left  by  X 


Now  if^yVA’)!  - |fM£M(X)|  + i if  x^j  is  REACX — >,  which  is  the  length  of  the 
response  to  X’. 


Also,  If  x^,  - WHITE(--r-),  f^fX’)  - #MEM(X)S  and  if  x^,  - REACKaddr),  fMEM(X’) 
» fM£M(X)  concaten  fsd  with  RTR^addr,  z),  where  z - the  data  field  of  the  last  WRlTEtaddr,--) 
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packet,  or  zero  if  there  is  none.  This  is  just  the  contents  of  memory  word  addr. 

The  response  to  X*  is  therefore  fjjg/X*). 

This  system  has  e few  simplifying  properties  that  a general  system  of  the  sort 
to  be  used  in  the  packet  memory  system  can’t  have: 

1)  It  is  determinate. 

2)  Its  behavior  is  defined  for  all  possible  input  histories,  that  is,  there  are  no  illegal  inputs. 

3)  It  operates  infinitely  fast,  that  is,  it  is  impossible  for  input  commands  to  come  too  fast  for  it 
to  handle.  (Note  that  the  above  proof  says  "when  arrived,  the  system  transmitted  _") 
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2.1  NONDETERMINACY 

Nondeterminate  systems  cet<  take  a wide  variety  of  forms,  and  the  problem  of 
formalizing  the  behavior  of  all  nondeterminate  systems  is  far  too  complex  to  be  treated  in  this 
thesis.  Only  the  types  of  nondeterminacy  that  arise  in  connection  with  the  structure  facility 
for  the  data  flow  machine  will  be  treated. 

The  principal  type  of  ncndeternv'nacy  that  will  arise  in  packet  memory  systems 
is  the  removal  of  the  requirement  that  the  RTR  packets  be  returned  in  the  same  order  as  the 
READ  packets  that  guv*  rise  to  them.  For  example,  the  input  history 

WRITEU.l  1) ; WRITE(2,22)  { READU) ; REAEX2)  could  result  in 

RTR(1,11)  i RTR(2,22)  or  in  RTR(2,22) ; RTR(l,i  1) 

The  system  MEM  is  too  simple  to  display  this  sort  of  nondeterminacy.  For  example,  MEM 
would  return  RTR(1,11)  as  soor,  as  it  received  the  first  READ  packet,  it  would  not  yet  "know" 
that  it  was  about  to  receive  a second  READ  packet  which  would  give  it  the  option  of 
producing  its  cuiput  packets  in  either  of  two  orders,  later,  we  will  exhibit  implementations  of 
systems  which  can  meaningfully  take  advantage  of  this  nondeterminacy.  For  now,  we  wiM  just 
have  to  accept  that  such  implementations  (that  is,  descriptive  specifications)  exist,  and 
examine  the  form  that  the  functional  specification  for  such  a system  might  take. 

2.1.1  FUNCTIONAL  SPECIFICATIONS  OF  NONDETERMINATE  SYSTEMS 

A nondeterminate  system  can  give  any  of  several  legal  output  histories  in 
response  to  a given  input  history.  The  "function"  defining  the  system’s  behavior  is  therefore 
multiple  valued.  Ono  *.  ay  to  handle  this  situation  is  to  treat  me  behavior  of  a system  as  being 
defined  by  a relation  instead  of  a function.  The  method  to  be  used  here,  wb'th  is  completely 
equivalent,  is  to  use  functions  whose  values  are  sets  of  output  histories.  For  example,  in  the 
system  f^.D^M  that  we  are  developing, 
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fNOMEM(WR,TE<1»n)  * WRITE<2»22)  I READd)  5 REAIX2))  - 
{ ( RTRO,l  1)  { RTR(2,22) ) , { RTR(2,22)  { RTR(1,1 1) ) } 

The  situation  may  arise  that  f<X>  is  empty  (or  some  X.  This  means  that  X is  not  a 
valid  input  history,  and  the  behavior  of  the  system  is  undefined.  This  is  different  from  the 
situation  in  which  an  illegal  input  gives  rise  to  a well-defined  "error"  response  (packet)  from 
the  system.  An  "error"  packet  is  certainly  more  desirable  than  saying  the  system  behavior  is 
undefined,  but  some  situations,  such  as  receiving  scknowiedges  for  packets  that  were  not 
sent,  are  so  pathological  they  must  simply  be  assumed  not  to  occur.  Furthermore,  at  some 
levels  of  detail  in  the  description  of  a system,  if  is  convenient  to  ignore  error  conditions  if  one 
can  prove  that  they  won’t  occur  when  the  system  is  functioning  properly. 

A functional  description  of  a nondeterminate  system  is  therefore  a definition  of 
a function  which  maps  input  histories  into  sets  of  output  histories.  It  is  usually  most 
convenient  to  describe  it  as  a predicate  defining  which  histories  are  in  f(X)  for  a giver}  X,  and 
that  predicate  is  often  the  logical  AND  of  a number  of  other  predicates,  so  the  functional 
description  looks  like: 

Y is  in  f(X)  if 

P,(X,Y)  and 
P2(X,Y)  etc. 

The  rule  for  realization  of  a function  is  as  follows:  A system  realizes  f if,  given  input  history 
X with  f(X)  nonempty,  it  will  eventually  produce  some  output  history  in  f(X). 

The  multiple  valued  functions  realized  by  nondeterminate  systems  must  obey  a 
monotonicity  property  as  follows: 


? 


t 

1 

1 


35 

NONDETERMINATE  MONOTONICITY  (ND-MONOTONIC1TY) 

If  Q and  P art  input  histories  and  Q > P,  then  for 
any  output  history  X in  f(P),  if  f(Q)  is  nonempty  there 
is  a history  Y in  f(Q)  with  Y £ X. 

Roughly  speaking,  this  means  that  receipt  of  a legal  input  symbol  will  never 
make  the  system  unable  to  proceed  legally.  The  purpose  of  the  qualification  Mif  f(Q)  is 
nonempty"  is  to  allow  for  the  possibility  that  an  illegal  input  packet  might  make  the  system 
ursoble  tc  proceed. 

We  can  now  give  the  functional  specification  for  the  nondeterminate  memory 
NDMEM,  which  can  arbitrarily  mix  RTR  packets  for  different  addresses. 


If  X - input  history  and  Y - output  history, 

Y is  in  fN0MEM{X)  if 

(1)  Y consists  only  of  packets  RTR(— and 

(2)  For  all  eddr,  the  number  of  READ(addr)’s  in  X » the 

number  of  RTR(eddr,--)’s  in  Y,  and 

(3)  For  all  addr  and  K,  the  K,h  RTR{addr,— ) in  Y,  if  it  exists,  is  RTR(addr,val) 

where  last  WR!TE(addr,--)  in  X befora  K,h  READ{addr)  in  X 
is  WRITE(addr,val)  if  such  a WRITE(sddr,— > exists,  or  val  ■ 0 
if  no  WF:ITE(addr,--)  exists  before  the  K,k  REACXaddr)  in  X 


The  system  NOMEM  has  the  property  that  the  data  returned  in  a RTR  packet  is 
the  data  in  the  memor>  (that  is,  the  deta  in  the  most  recent  WRITE  command  addressing  that 
call)  at  the  Instant  of  the  READ  command  corresponding  to  the  RTR.  At  the  instant  the  RTR 
packet  ia  sent  out,  another  WRITE  command  might  have  already  been  received,  but  that  WRITE 
will  have  no  effect  on  this  RTR  packet. 
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Example 

Input:  WRITE(A,1)  REACXA)  WRITE(A,2>  RZAD\'A) 

output:  RTR(A,1)  RTR(A,2) 

f time 

At  the  instant  the  firsit  RTR  packet  was  returned,  a WRITE  command  changing 
the  data  from  1 to  2 had  already  beoen  given,  but  tlte  function  requires  that  the  value 
1 be  returned. 

Here  is  a rough  outline  of  an  implementation  of  a system  that  realizes  f^^  : 

SYSTEM  *1  (realizing  fN0MEM) 

(1)  When  a WRITE  command  comes  in,  write  on  the  word  of  memory  instantly. 

(?)  When  a READ  command  comes  in,  fetch  the  word  from  memory  instantly, 
form  a RTR  message,  ana  put  it  into  a buffer  or  queue. 

(3)  Take  messages  out  of  the  buffer  and  return  them  as  output  packets  at  any 
time  and  in  any  order,  subject  to  the  restrictions  that: 

(a)  every  packet  in  the  buffer  is  eventually  removed, 

(b)  whenever  a packet  is  removed,  it  must  be  the  oldest  in 

the  buffer  among  those  with  its  word  address  (that  is, 
the  buffer  is  first-in-first-out  (FFO)  with  respect  to 
each  address). 


The  implementation  given  above  still  requires  that  operations  on  the  memory  be 
instantaneous,  so  it  is  not  very  useful  because  it  doesn’t  take  advantage  of  the  daisy  between 
a READ  packet  and  the  RTR  packet  that  results.  The  data  in  the  RTR  packet  must  be  the 
contents  of  the  memory  word  at  the  instant  the  READ/RTR  interval  begins.  We  would  like  the 
system  to  be  able  to  use  the  value  of  the  memory  word  at  arv^  instant  during  the  READ/RTR 
interval.  Here  is  an  example  of  a system  that  takes  such  liberty: 
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SYSTEM  #2  (purported  realization  of  f^^ip 

(1)  When  a WRITE  command  comes  in,  write  the  word  of  memory  instantly. 

(2)  When  a READ  command  comes  in,  put  the  message  READ(addr)  in  the 

Pending  Read  Buffer  (PR8). 

(3)  Take  messages  off  the  PRB  at  any  time  and  subject  to 

the  same  restrictions  as  before,  namely  that  every 
message  is  eventually  removed  and  the  buffer  is  FIFO  on 
each  address.  When  the  message  READfaddr)  is  taken  from  the 
Pending  Read  Buffer,  fetch  the  data  from  memory  and  form 
a message  RTR(addr,data).  Send  the  latter  to  the 
Finished  Read  Buffer  (FRB). 

(4)  Take  messages  off  the  FRB  at  any  time  and  in  any  order 

subject  to  the  same  restrictions  as  before,  form  a RTR 
packet,  and  send  it  as  output  of  the  system. 


This  implementation  does  not  realize  fWMEM  • In  the  packet  timing  graph  afier 
the  definition  of  ^ packet  might  have  value  1 or  2 if  this  implementation  is 

used.  (The  secend  RTR  packet  will  always  have  data  value  2.) 

We  might  like  the  system  to  take  even  more  liberty,  by  performing  memory 
writes,  as  well  as  reads,  whenever  it  wishes.  Such  an  implementation  might  be  as  follows: 


System  #3  (purported  realization  of  fN0MEU) 

(1)  When  a WRITE  packs*  tomes  in,  put  the  message  WRITE(addr,data) 

on  the  Pending  Write  Buffer  (PWB). 

(2)  Same  as  (2)  in  System  #2. 

(3)  Tgko  messages  off  the  PWB  subject  to  the  same  restrictions 

as  before,  and  write  the  data  into  memory. 

(A)  Same  a«  (3)  in  System  «2,  except  that  there  is  an  additional 


38 


restriction  that  no  message  may  be  taken  from  Ihs  PRB  if  a 
message  addressing  that  word  is  on  the  1 SWB. 

(5)  Same  as  (4)  in  System  *2. 


This  too  fails  to  realize  f^DMEM  • i however,  both  {System  «2  and  System  *3  do 
realize  f if  no  WRITE  packet  is  ever  sent  to  the  system  when  any  READ/RTR  transactic  ns 
are  in  progress  on  that  word.  That  is,  before  a WRITE. packet  is  sent,  a RTR  packet  must  have 
been  received  for  every  READ  packet  seni  addressing  that  word.  Fortunately,  it  is  not 
difficult  to  guarantee  that  this  requirement  is  met:  it  is  simply  a nondeterminate  functional 
specification  for  the  "rest  of  the  world",  which  we  will  call  tl*«  "user*. 

Definition:  The  user  of  a system  is  that  to  which  the 

system  connects,  and  is  itself  a system.  The  input  ports  of 

the  user  are  the  output  ports  of  the  given  system,  and  vice-versa. 

It  would  of  course  be  totally  useless  to  require  that,  in  order  for  a realization 
of  f^p^y  to  work,  its  user  must  realize  a determinate  functional  specification.  In  fact,  the 
user  of  a system  shouid  have  as  few  restrictions  on  its  behavior  as  possible.  Such 
restrictions  can  generally  be  specified  by  requiring  that  the  user  realize  some  nondeterminate 
function,  Just  as  the  system  itself  does.  That  is,  the  difference  between  system  specifications 
and  utar  specifications  is  nothing  but  a matter  of  degree  of  restrictivaness. 

The  requirement  that  NDMEM’s  user  not  send  a WRITE  command  when  any 
READ/RTR  transactions  are  in  progress  can  be  m«t  by  requiring  it  to  realize  the  following 
nondeterminate  functional  specification  fj><OMEMUSER  : 

fNDMEMUSER 

If  V - input  history  of  USER  and  X « output  history, 

(note  the  exchange  of  input  end  output  so  that  X and  Y 

refer  to  the  same  packet  streams  in  both  tbs  system  aH  its  "set) 


' 'K,J»*r'WV«'v  -*“ 
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th«n  X is  in  ^memuser^^  if 

<1)  X consists  only  of  packets  READ(~*)  and  WRITE<~,~) 
(2)  For  all  addr,  for  any  VYRITE(addr,~)  in  X,  the  number  of 
R£AD(addr)’s  preceding  it  in  X is  ^ the  number 
of  RTR{addr,--)’s  in  Y 


The  function  f|^)M€MUSER 's  9®si*y  se8n  ND-monotonic.  This  is  because  the 
restrictions  on  the  user’s  output  X never  become  mors  stringent  as  Y increases.  As  Y 
increases,  the  proposition  "the  number  of  READfaddr)’s  preceding  it  in  X is  ^ the  number  of 
RTR{aadr, — Vs,  in  Y“  never  goes  from  true  to  false,  so  the  set  of  legal  arrays  X does  not 
decrease.  (If  the  "<["  had  been  replaced  by  it  would  not  be  ND-monotonic.) 

While  system  *3  does  not  by  itself  realize  fN0MEM,  it  does  realize  fNCMCM  if 
connected  to  a user  that  realizes  f^^  To  prove  this,  the  important  step  is  to  show  that  each 
READfaddr)  packet  generates  a RTR  packet  containing  data  defined  by  the  most  recent 
WRIYE(«ddr,~)  packet  preceding  tha  given  READ(addr)  packet  in  the  input  stream. 


Let  t0  ■ the  instant  when  the  REAO(addr)  packet  comes  in.  There  may  be 
pending  WRITE(addr,~ > packets  in  the  PWB  at  tQ.  If  there  are  none,  the  most  recent 
WRITE(add%— ) packet  in  the  input  stream  has  already  passed  out  of  the  PWB  and  into  the 
memory  unit,  so  its  data  is  in  memory  word  addr.  If  there  are  WRITE(addr,--)  packets  in  the 
PW8  et  t^  the  most  rocontly  inserted  packet  there  is  the  most  recent  WRITE(addr,-->  packet 
in  the  input  stream.  Therefore,  letting 


r 

the  data  in  the  youngest  WRITF(addr,~)  packet  in  the  PWB  at  time  t 
« if  there  is  such  a packet 

the  contents  of  word  addr  in  the  memory  unit  if  not, 


we  must  show  that  the  data  to  be  eventually  returned  in  a RTR  peckot  is  D^Uq).  Let  tj  - 
the  instant  wnen  the  REAtXaddr)  packet  leaves  the  PRB.  First,  we  show  that  D^tt)  does  not 
change  from  to  t(.  Since  the  REACKaddr)  packet  has  entered  the  sy»iem,  it  ha^  left  the  user. 
Since  the  corresponding  RTR(addv)  packet  has  not  yet  been  generated  by  tha  system  (and 
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won’t  be  until  after  tt),  it  has  not  been  received  by  the  user.  Therefore,  there  is  a READ/RTR 
transaction  pending  on  addr,  so  the  user  is  not  sending  any  wniTEfaddr,-*)  packets. 
Therefore,  whichever  WRITE(addr,~)  packet  in  the  PWB  is  youngest  will  stay  youngest  as 
long  as  It  stays  in  the  PW&  So  as  long  as  there  are  any  WRITEfaddr packets  in  the  PWB, 
Dlt)[(  does  not  change.  As  long  as  there  are  no  WRIT£(sddr,~ ) packets  in  the  PWB,  D#AJr  - the 
contents  of  memory,  which  doesn’t  change  either,  because  only  removal  of  a WRITEfaddr,— ) 
packet  from  the  PWB  cars  change  the  contents  of  memory  word  addr. 

There  can  be  no  transitions  from  no  WR!TE(addr,~)  packets  in  the  PWB  to  one 
or  more  packets,  because  the  user  is  not  sending  any.  The  remaining  case  to  consider  is  the 
disappearance  of  the  last  WRIT£(sddr,-~)  packet  from  the  PWB.  This  packet  is  cieariy  the 
youngest,  so  D^fjust  prior  to  disappearance)  » the  data  in  the  packet.  This  data  is  written 
into  memory  by  rule  3 of  the  implementation.  O^fjust  after  disappearance)  - data  written 
into  memory  - date  in  the  packet  ihat  disappeared.  Therefore  D^(t0)  * D||jJ|(t|). 

At  time  t,,  when  the  READ(addr)  packet  leaves  the  PRB,  there  are  no 
WRITEfaddr,— ) packets  in  the  PWB,  by  rule  4 of  the  implementation.  Therefore  D^ftj,)  - 
i ) “ contents  of  memory  word  addr  at  tj.  But  whan  the  READ(addr)  packet  !s  taken  from 
the  PRB,  the  memory  word  is  read,  and  its  data  goes  into  a RTRfaddr,— ) packet  in  the  FRB. 
i hat  packet  Is  therefore  RTRfaddr.Q^O’g)),  and  Is  tha  packet  that  will  eventually  be  returned 
to  the  user. 


This  example  demonstrates  a general  principle: 

Whether  or  not  a given  implementation  of  a sys'tem  realizes  a 
given  function  may  depend  on  whether  the  system’s  user 
realizes  some  other  specific  function. 


There  is  no  way  to  get  around  this  fact.  There  are  systems  that  correctly 
realize  useful  functions  (even  completely  determinate  functions)  wher  connected  to  systems 
that  obey  certain  rules,  t t behave  in  9 totally  pathological  ,*ay  otherwise.  Furthermore,  the 
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system  often  csr.’t  teli  whether  the  user  has  broken  the  rules.  In  the  case  of  system  #3 
above,  the  system  would  have  been  able  to  tell  whether  a VVRITE(addr,--)  packet  came  in 
while  a READ/RTR  transaction  was  pending  on  word  addr,  but  in  some  cases  the  system  has 
no  way  of  knowing  whether  its  user  is  misbehaving. 

The  structure  controHer  and  packet  memory  system  for  a data  flow  computer  is 
such  a system.  Perhaps  the  most  important  example  of  the  structure  controller  and  memory’s 
dependence  on  the  behavior  of  their  user  is  the  reference  count  and  garbage  collection 
problem.  The  rules  that  the  user  (U.  the  data  flow  computer)  must  obey  in  order  to  assure 
eurfset  rifsreitcs  accounting  are  as  follows; 

(1)  No  pointer  to  a structure  may  be  duplicated  without  giving  a 

command  to  increase  the  reference  count. 

(2)  No  command  to  decrease  the  reference  count  may  be  given 

unless  a pointer  is  discarded. 

These  rules  guarantee  thst  the  reference  count  for  a node  is  st  least  as  great 
as  the  number  of  pointers  to  the  node  contained  anywhere  in  the  computer.  (Actuaiiy,  the 
rules  will  be  such  that  the  reference  count  is  exactly  equal  to  the  number  of  pointers  to  ths 
iroda  However,  the  penalty  for  too  high  a reference  count  is  simply  that  a useless  structure 
itik,  to  be  reclaimed  and  wastes  memory  space.) 

Now  suppose  the  computer  (that  is,  the  structure  controller’s  and  memory’s 
user)  violates  the  rule  and  allows  the  reference  count  to  become  too  small.  Eventually  the 
reference  count  may  become  zero  while  s pointer  to  the  node  still  exists  somewhere.  When 
the  count  goes  to  zero,  the  memory  system  reclaims  the  node  and  puts  it  on  the  list  of  free 
nodes. 


Two  possibilities  then  arise.  If  an  immediate  attempt  is  made  to  use  the 
“spurious*  pointer  to  the  cell,  in  a SELECT  instruction  for  example,  the  structure  controller  will 
sand  a READ  command  to  the  memory.  Ths  memory  will  know  that  this  is  an  illegal  command, 
that  >$,  that  the  user  has  violated  its  specification.  It  can  then  signs!  an  appropriate  <snor 
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eondltien  in  order  to  prevent  the  computation  from  giving  an  incorrect  result. 

if,  on  the  other  hand,  the  cel!  is  removed  from  the  free  storage  list  end  used  by 
the  structure  controller  to  build  some  new  structure  by  the  time  the  spurious  pointer  is  used, 
there  is  no  way  the  memory  can  teli  that  a violation  has  occurred.  It  has  no  choice  but  to 
process  the  spurious  command  in  the  normal  way,  which  resuits  in  its  referring  to  a structure 
which  is  completely  different  from  what  was  intended 

This  is  no!  to  say  that  the  data  flow  computer  has  no  way  to  check  for  errors 
in  the  handling  of  reference  counts.  Methods  of  doing  sc  wiil  be  discussed  in  section  5.0.5. 

2.1.2  MUTUAL  CONSISTENCY  OF  FUNCTIONAL  REALIZATIONS 

Suppose  a system  realizes  f^  contingent  on  its  user  realizing  f^p  , which  the 
user  does  if  the  original  system  realizes  . Does  It  follow  that  the  realizations  actually 
occur  when  the  two  systems  are  connocted  to  each  other?  Is  it  possible  that  they  could  both 
violate  their  specifications,  with  each  blaming  the  other?  Let  the  systems  be  S and  T.  Each  is 
the  other’*  user. 

If  any  violation  does  occur,  there  must  be  a first  instant  of  violation.  That  is, 
there  is  an  instant  tQ  when  It  first  becomes  true  that  one  system,  (say  S)  has  an  output  history 
which  does  not  legally  follow  from  its  input  history.  There  is  a delay,  however  slight  (even  if 
it  Is  only  the  delay  caused  by  propagation  of  electric  currents  through  wires)  in  the  behavior 
of  ti.  Therefore  S’s  output  history  at  t0  depends  on  "Ts  output  history  slighily  before  t0  , at  a 
time  when  T was  ntiv  malfunctioning,  so  S cannot  blame  its  malfunction  on  T.  Even  if  S end  T 
both  malfunction  at  precisely  the  same  instant,  neither  $ nor  T knows  about  the  malfunction  of 
the  other  at  tnat  instant,  and  so  neither  malfunction  can  be  excused  It  follows  that,  if  both 
systems  conditionally  obey  their  functional  specifications,  they  will  obey  thei  specifications  in 
practice. 
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2.1.3  MONOTONICITY  OF  FUNCTIONAL  SPECIFICATIONS  OF  THE  USER 

We  now  give  an  example  cl  how  not  to  define  the  functional  specification  of  a 
user.  Suppose  the  system  MEM  has  destructive  readout,  so  that  it  requires  that  the  user 
rewrite  any  data  that  it  reads.  Suppose  furtner  that  for  some  reason  the  same  data  must  be 
rewritten,  and  that  it  must  be  done  immediately,  that  is,  no  other  transactions  may  take  place 
at  any  address  between  the  read  and  the  rewrite.  Here  is  an  attempt  at  a functional 
specification  for  USER.  Since  USER  doesn't  know  what  data  to  write  until  it  receives  the  RTR 
packet,  we  will  require  the  rewrite  to  be  a consequence  of  the  RTR. 


fUSER 

Y » input  to  user,  X * output  from  user 

For  all  addr  and  i,  if  the  Ith  RTR(addr)  exKs  in  Y and  is  RTR(addr,data), 
then  the  i,h  READfaddr)  in  X is  immediately  followed  in  X by  WRITE(addr,data) 


Unfortunately,  this  does  not  require  the  user  to  wait  for  the  RTR  packet  after 
sending  any  READ,  not  sending  any  more  packets  until  the  RTR  arrives.  For  example,  the  user 
might  send 


< READU)  j READ(2> ) 

Until  the  RTRU.data)  packet  comes  back,  the  user  has  not  broken  any  rules. 
When  the  RTR(l,data)  does  come  back,  the  user  will  have  retroactively  broken  the  rules  end 
be  unable  to  do  anything  about  it.  Since  we  would  like  to  simplify  as  much  as  possible  the 
task  of  proving  that  systems  obey  functional  specifications,  we  need  to  make  the 
specifications  reflect  the  types  of  decisions  that  systems  make  in  practice.  It  doesn’t  make 
sense  for  a system  to  perform  $om«  operation  or  emit  some  result  packet  on  the  basis  of  «n 
input  packet  not  having  arrived  and  not  being  about  to  arrive,  so  , as  given  above,  is 
unreasonable. 
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The  problem  is  that  f^p  Is  not  ND-monotonlc.  To  cee  this,  refer  to  the 
notation  in  the  definition  of  NO-monotonicity  and  let 

P - < Q - RTRfl,dita)  [input  histories] 

X - <READ(1)  j READ<2»  [output  history] 

Now  Q £ P,  X is  in  f(jg£p(P)  and  fy^Q)  is  nonempty  (containing,  for  example, 
READU)  > WRITE(l,data)  j READ{2)),  but  there  is  no  history  in  fyggptQ)  that  is  > X. 

The  correct  specification  for  the  user  is: 


If  Y - 

input  to  user,  X - output  from  user 

For  all  addr  and  i,  the  i,h  REAEXaddr)  in  X,  if  it  exists,  is 

immediately  followed  in  X by  WRITE(addr,data) 

< 

if  there  is  an  i,h  RTR(addr,— ) In  Y and  it  is  RTR(addr,dsta) 

last  in  X if  there  is  no  i,h  RTR(addr,~)  in  Y 

This  is  easily  seen  to  be  Nonmonotonic. 


2.2  PACKET  ACKNOWLEDGMENTS  AND  SAFETY 


All  of  the  systems  considered  so  fer  have  had  to  respond  to  incoming  packets 
however  fast  they  were  sent  by  their  user,  and  there  was  no  limit  to  the  rate  at  which  the 
user  could  send  tnem.  in  the  first  implementation  of  MEM,  the  memory  unit  has  to  accept  the 
commands  directly,  and  hence  has  to  operate  at  unlimited  speed.  System  «3,  implementing 
NDMEM,  seems  a slight  improvement  in  that  it  only  has  to  put  the  commands  into  its  buffers 
infinitely  quickly,  until  one  realizes  that  unless  the  memory  unit  itself  is  infinitely  fast  the 
buffers  have  to  be  infinitely  large. 

This  is  clearly  unacceptable!  no  interconnection  of  speed-independent  modules 
can  make  such  assumptions.  The  problem  is  one  of  safety.  No  packet  may  be  sent  until  its 
destination  1$  ready  to  receive  it.  The  safety  problem  arises  at  several  levels  in  data  flow 
computers.  Here  we  are  concerned  with  it  only  at  its  most  microscopic  level.  The  solution  to 
the  problem  is  to  ar, knowledge  each  packet  transmission.  That  is,  for  each  port  transmitting 
data,  there  is  another  port  transmitting  acknowledge  packets  in  the  opposite  direction.  Every 
data  packet  must  be  acknowledged  before  the  next  data  packet  can  be  sent  on  the  same  port. 
We  will  require  aH  ports  of  aH  systems  to  have  such  an  acknowledge  port. 

(Even  systems  which  would  be  safe  without  acknowledge  ports  will  have  them. 
This  is  because  of  the  manner  in  wciich  packets  are  transmitted.  A packet  transmission  is 
indicated  by  a zero  to  one  transition  of  a "request*  signal.  An  acknowledge  signal  from  the 
receiver  is  needed  to  tell  the  transmitter  to  reset  the  request  signal.) 

The  implementation  of  the  system  MEM  may  be  modified  to  acknowledge  input 
commands  only  after  the  transaction  on  the  actual  memory  unit  is  completed.  This  will  make  it 
impossible  for  the  user  to  send  a command  while  the  memory  is  busy.  Of  course,  the  output 
port  must  also  have  acknowledges,  since  the  system  to  which  the  RTR  packets  are  sent  might 
be  slow  and  need  to  be  protected  against  overruns  on  Its  input.  So  the  algorithm  for  AMEM 
(MEM  with  acknowledges)  might  be: 


(1)  If  a WRITE  packet  is  received,  update  the  memory  (take  your  time!) 


and  then  send  an  acknowledge  on  the  input  acknowledge  port. 

(2)  If  a READ  packet  is  received,  fetch  data  from  the  memory  and  send 

a RTR  packet  out. 

(3)  If  an  acknowledge  is  received  on  the  output  acknowledge  port, 

send  an  acknowledge  on  the  input  acknowledge  port. 

These  three  operations  proceed  concurrently  and  independently. 

Transmission  of  acknowledge  packets  is  behsviorally  similar  to  transmission  of 
normal  packets,  and  can  be  handled  in  the  same  way  in  the  specification  of  a system.  That  is, 
tho  acknowledge  ports  associated  with  output  ports  are  treated  exactly  as  though  they  were 
input  ports,  end  vice-versa.  The  system  AMEM  has  two  input  ports:  the  "real"  input  port  X 
and  the  output  acknowledge  port  YA;  and  two  outputs:  the  "real”  output  port  Y and  the  input 
acknowledge  port  XA . 

fAMEM 

input  ports  - X,  YA  output  ports  ■ Y,  XA 

(1)  |Y|  - number  of  READs  In  X 

(2)  Y,  - RTR(addr.data)  where  the  i,k  READ  in  X 

is  READfaddr)  and  the  last  WRITE(addr,~)  before  it,  if  there  is 
one,  is  WRITE(addr,data),  or  data  ■ 0 if  there  is  no  WRlTE(addr,-~) 
before  the  i,h  READ 

(3)  pg  - |VAj  * number  of  WRiTEs  in  X 

(4)  <XA)f  - "aci," 

(5)  |Ya|  £ |Y|  £ jYA|  ♦ 1 

(6)  pci  - 1 £ |xA|  £ |x; 

It  is  easy  to  prove  that  the  given  implementation  realizes  parts  (1),  (2),  (3),  and 

(4)  of  fAMEM  . (It  is  very  similar  to  MEM.)  Parts  f4),  (l*),  ar,d  (f>)  constitute  thb  "Standard 
Ackncwledge  Restriction"  that  we  will  require  systems  and  aH  users  to  obey. 


' XjejP’CsS’  At  ■ 
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Standard  Acknowledge  Restriction  (S.A.R.)  - weak  form 

If  X is  an  input  port  and  XA  is  iti  acknowledge  port, 

(1)  XA  consists  only  of  "ack" 

(2)  tXA|  £ |X| 

If  Y is  an  output  port  and  YA  is  its  acknowledge  port, 

0)  |Y|  S |YA|  + 1 

Given  that  a system  and  its  user  both  obey  the  woak  form  of  the  SAR.,  we  can 
easily  show  that  they  obey  the  following: 

Standard  Acknowledge  Restriction  (SAR.)  - strong  form 

If  Z is  an  input  or  output  port  and  ZA  is  its  acknowledge  port, 

(1)  ZA  consists  only  of  "ack" 

(2) |ZAmZ|£|ZA!  + l 

Proof:  If  Z is  an  input  port  of  the  system  and  an  output  port  of  the  user,  (1)  and  |ZA|  < |Z| 
follow  from  the  SAR.  on  the  system  (letting  Z * X)i  and  jZ|  £ |ZA|  + 1 follows  from  the  SAR. 
on  the  user  (lotting  Z - YJ>  If  Z is  an  output  port  of  the  system  and  an  input  port  of  the  user, 
just  exchange  "system"  and  "user". 

The  S.A.R.  is  clearly  Nonmonotonic  and  hence  admissible  as  part  of  a functional 

specification. 


In  any  proof  that  a system  realizes  a function,  it  suffices  to  show  that  it  obeys 
the  weak  lorm  of  the  SAR.  contingc.it  on  its  user  obeying  the  strong  form. 

We  can  now  provn  that  AMEM  realizes  parts  (5)  and  (6)  of  ?AMgM  , that  is,  the 
SAR.  in  strong  form. 
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Let  Y « output  of  Ak€M  and  input  to  user,  X - input  to  AMEM  and  output  of  user. 

First,  number  of  WRITES  in  X 

- number  of  acks  sent  on  XA  in  consequence  of  (i)  of  AKCM’s  implementation 

* p(A|  - number  of  tcks  sent  on  XA  in  consequence  of  (3)  of  AMEM*s  implementation 

- ny  - w 

Now  |Y|  - number  of  READs  in  X (by  (2)  of  AMEM's  implementation) 

- |X|  - number  of  WRITES  in  X (by  well-behavadness  of  user) 

• |X|  - |X-|  + |YJ  (derived  ibove) 

£ |XA|  + 1 - IXA!  + |Ya|  (from  SAR.  for  user) 

|Y|  S |Ya|  ♦ 1 

Also  |XA|  - number  of  WRITEs  in  X + |YA|  (derived  above) 

<i  number  of  WRITIis  in  X ♦ |Y|  (from  SAR  for  user) 

- number  of  WRITES  in  X + number  of  READs  in  X (by  (2)  of  AMEM’s  implementation) 

-M 

This  proves  the  weak  form  of  the  SAR.,  from  which  the  strong  form  follows. 

2.2.1  CANONICAL  PACKET  COMMUNICATION 

Since  the  Standard  Acknowledge  Restriction  narrowly  limits  the  way 
acknowledge  ports  are  handled  in  the  functional  specification  of  a system,  it  is  not  uncommon 
for  the  handling  of  the  acknowledge  ports  to  be  similarly  limited  in  the  implementation  of  the 
system.  Wherever  possible,  system  Implementations  will  receive  and  transmit  packets  in  the 
following  way: 
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Canonical  Packet  Reception  (RCVPKT) 

(1)  Wait  until  a packet  has  arrived  on  the  input  port  (it  might  have  already  arrived  by  the 
time  this  step  is  executed);  take  its  data 

(2)  Send  an  acknowledge  for  it 


) r 

• Canonical  Packet  Transmission  (XMTPKT) 


(1)  Send  the  packet 

(2)  Wait  for  an  acknowledge 


These  operations  will  appear  in  the  system  implementation  language  as 
"functions"  that  take  port  names  as  arguments  and  appear  in  assignment  statements.  The  data 
conveyed  by  the  is  the  contents  of  the  packet.  Assignment  statements  containing  these 
operations  are  like  input/output  operations  in  ordinary  computer  programs  in  that  they  "hang 
up"  the  program  until  the  packet  communication  has  taken  place.  "Var  RCVPKT(port>"  waits 
until  an  incoming  packet  has  arrived  (and  then  acknowledges  same).  "XMTPKT(port)  :■ 
expression"  waits  until  the  transmitted  packet  h?s  been  acknowledged.  Programs  may  use 
multiprocessing  as  long  as  no  RCVPKT  or  XMTPKT  operations  can  be  simultaneously  executed 
by  two  processes  on  the  same  port. 

It  is  easy  to  see  that  any  implementation  using  the  RCVF’KT  and  XMTPKT 
operations  obeys  the  Standard  Acknowledge  Restriction. 

Systems  need  not  use  these  canonical  operations  in  order  to  be  correct.  For 
examplo,  tha  implementation  of  AMEM  given  previously  did  not.  That  is  why  the  proof  that  it 
obeyed  the  Standard  Acknowledge  Restriction  w.  so  complicated. 

Here  is  an  implementation  of  CMEM,  a system  whose  behavior  is  similar  (but  not 
identical)  to  AMEM; 
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process  starts  at  A 
Input  port  X 
output  port  Y 
errey  memory  init  0 
var  command,  a<kir,  data 

A:  command  RCVPKT(X)j 

jf  eommand  - REAtX--)  then 
let  command  ■ READtiddrh 
data  tnemory(addr); 
XMTPKT(Y) RTR(addr,data> 
else 

let  command  - WRITE(«ddr/iata)j 
ntemory(addr)  i-  data} 
goto  A 
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25  LATENCY 


CMEM  and  AMEM  behave  differently  In  a subtle  /ay.  Suppose  the  user 
transmits  a READ  packet  and  then  refuses  to  acknowledge  the  sTTR  packet  that  results.  AMEM 
refuses  to  acknowledge  the  .original  READ,  and  the  entire  system  comes  to  a halt,  since  the 
user  can’t  send  another  command  packet  until  the  previous  one  was  acknowledged.  CMEM 
acknowledges  the  READ  packet  anyway  (it  happens  automatically  as  part  of  the  RCVFKT 
operation).  It  then  refuses  to  acknowledge  any  further  command  packets  until  the  RTR  is 
acknowledged,  because  it  gets  hung  up  in  the  statement  "XMTPKTOO :»  RTR(addr,daia)\ 
CMEM  behave*  a*  though  it  has  an  input  buffar  capable  of  storing  one  packet. 

This  difference  shows  up  in  the  functional  specification.  Lines  2,  4,  5,  and  6 of 
the  specification  of  [section  2.2]  apply  to  UivtEM  also.  Lines  1 and  3 are  different: 

fAMEM 


(1)  |Y|  - number  of  READs  in  X 

(3)  |XA|  - |Ya!  + JXJ  - number  of  READs  in  X 


(1)  1Y|  - 


number  of  READs  in  X if  |Xj  - 0 or  1 

< or  ( |X|  £ 2 and  ,rg  £ number  of  READs  in  (X  - last  packet)) 

number  of  READs  in  (X  - last  packet)  otherwise 


o)  pg  - 


f|X|  if  p<|  - 0 Of  i 

< or  ( p<|  > 2 and  pg  > number  of  READs  in  (X  - last  packet)) 

p(|  - I otherwise 


This  illustrates  ine  fact  that  correct  analysis  of  the  latency  of  a system  can  be 
quite  complicated  and  requires  careful  analysis  of  the  algorithm. 
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The  only  difference  between  AMEM  end  CMEM  arises  if  the  user  fails  to 
acknowledge  ell  RTR  packets,  that  is,  if  jY^j  * |Y|.  If  |YA|  * |Y|,  one  can  easily  show  that,  for 
both  AMEM  and  CMEM, 

|Yj  » number  of  READs  in  X 

pg-pcj 

(To  prove  this  for  CMiM,  show  that  if  |X|  > 2,  the  case  {Y^|  < number  of  READs 
in  (X  - last  packet)  can’t  occur.) 

The  latency  of  a system  is  the  number  of  commands  that  it  can  accept  and 
acknowledge  whose  results  have  not  been  acknowledged  by  the  user;  that  is,  the  number  of 
pending  commands  that  it  can  "remember”.  Because  systems  are  so  varied  in  their  behavior, 
the  concept  of  latency  is  not  easy  to  define  precisely. 


One  system  for  which  it  can  be  defined  Is  the  FIFO,  or  first -in-first -out  buffer. 
A FIFO  of  length  N (and  having  latency  N)  is  a system  with  one  input  port  and  one  output 
port,  which  realizes  the  identity  function  and  acknowledges  up  to  N more  inputs  than  its  user 
hac  acknowledged  outputs.  Tha  function  realized  by  a FIFO  of  length  N is: 


U ► IS*I  I 


following  program: 


processor  start  at  A,  B 
input  port  X 
output  port  Y 


53 


var  m 

var  p init  0 | queue  population 

A:  until  p # N doj 

K RCVPKTfX); 
store  k at  end  of  queues 
P P + lj 
goto  At 

B:  until  p ^ 0 do; 

m item  taken  from  front  of  queues 
XMTPKT(Y) m> 
p p - It 
goto  B 

For  N * 1 this  becomes: 

process  starts  at  A 
input  port  X 
output  port  V 
var  P 


A:  P RCVPKTfX)} 

XMTPKTfY) Pj 
goto  A 

A FIFO  of  latency  zero  cannot  be  implemented  by  any  system  using  the  RCVPKT 
and  XMTPKT  operations,  though  it  can  be  ims>lementod  with  a few  pieces  of  wire. 

Appendix  I contain-?  a proof  that  e series  connection  of  FIFO’s  of  lengths  M and 
ft  yields  « FIFO  of  length  M+N. 
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When  systems  differ  only  in  their  latency,  it  is  sometimes  possible  to  make  them 
equivalent  by  adding  FIFO's  to  various  ports.  Fcr  example,  it  can  be  shown  that  CMEM  is 
identical  to  AMEM  with  a FIFO  cf  length  one  on  its  input.  If  it  could  be  shown  that  every 
system  X is  equivalent,  except  fcr  latency,  to  a system  XQ  defined  as  having  latency  zero,  then 
the  latency  of  the  system  X could  be  characterized  by  the  lengths  of  the  FIFO’s  that  would 
have  to  ba  added  to  the  various  ports  of  XQ  to  make  it  identical  to  X.  A system  of  latency  zero 
would  have  to  be  one  which  never  acknowledges  any  input  packet  until  all  resulting  output 
packets  have  been  sent  and  acknowledged.  AMEM  is  such  a system,  so  CMEM  could  be  said  to 
have  Satency  1 on  its  its  input  port  and  zero  on  its  output  port.  It  is  not  clear  whether  such 
an  analysis  can  ba  applied  to  nondeterminate  systems  of  significant  complexity. 

2.3.1  ARBITRATORS,  DISTRIBUTORS,  AND  ALLOCATORS 


Three  basic  systems  are  very  important  in  the  design  of  the  structure 
controller  and  memory,  as  well  as  other  places  in  data  flow  computers. 

The  <rbitrator  is  a nondeterminate  system  with  N inputs  ar,d  one  output,  which 
transmits  each  incoming  packet  to  the  output.  Tha  order  of  the  packets  from  each  input  must 
trj  preserved  in  the  output  stream.  The  order  iri  the  output  stream  of  packets  from  different 
ports  is  arbitrary.  In  any  reasonable  implementation  it  would  depend  on  which  input  packet 
arrived  first.  An  arbitrator  realizes  tha  following  function,  in  which  port  number  is  Indicated 
by  a superscript  instead  of  a subscripts 


basic  (zero  Is^oncy)  arbitrator  fARg 

If  X1,  X2, ...  XN  jre  inputs  and  Y is  output, 

• • • *6’  ^ 6 * • • • X I Y^)  if 

N 

(1)  IY!  - min  ix*| , |Ya|  + 1) 

f • 1 

(2X  V i € [.,N’,  |>^|  * number  of  packets  from  X'  !n  first  lY^j  packets  of  Y 
(3)  Vi  « [1,N1  if  Ui)  ■ |X  j,  tha  sequence  <i,  Xj>,  <i,  X^>, . . . <i,  X^(|)> 

is  a subsequence  of  Y. 
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Each  incoming  packet  is  tagged  with  its  port  number  so  that  its  source  can  be 
identified  in  the  output.  This  identification  feature  is  used  in  a few,  but  not  all,  applications  of 
the  arbitrator. 


Arbitrators  are  the  major  component  of  the  arbitration  network  of  the  data 
flow  computer.  The  principal  use  of  the  arbitrator  in  the  structure  memory  is  to  allow  the 
address  space  to  be  divided  into  small  pieces,  with  a separate  memory  module  handling 
transactions  on  each  piece.  The  LOAD  packets  sent  back  from  the  several  modules  are 
merged  in  an  arbitrator,  so  that  the  entire  interconnection  of  modules  behaves  as  if  it  were 
one  memory  system. 

Arbitrators  of  nonzero  latency  may  be  defined  as  zero  latency  arbitrators  with 
various  FIFO  buffers  on  the  ports.  Such  arbitrators  are  useful  in  various  places  throughout 
the  data  flow  computer,  but  there  is  one  place  where  the  arbitrator  must  have  latency  zero. 
Thie  is  in  the  transmission  of  packets  from  the  structure  controller  to  the  memory.  When  the 
structure  controller  receives  an  acknowledge  for  a packet  it  has  sent  to  the  memory,  it  must 
know  ' l that  packet  is  ahead  of  any  othar  packets  that  might  subsequently  be  sent  to  other 
input  ports  of  the  arbitrator  on  that  memory  unit.  This  problem  wilt  be  explained  in  section 
5.0A 


An  arbitretor  of  zero  latency  may  be  realized  by  the  following  program: 

process  starts  at  A 
input  ports  X;  . . . XN 
output  port  Y 
var  p,  input 

A:  wait  until  a packet  is  available  on  any  input  port, 

let  p :«  that  port; 
j this  is  nonde*  rminate! 
input  the  packet  on  port  p; 

XWTPKT(Y)  :«  <p  , ir.put>; 


j do  not  acknowledge  yet 


send  acknowledge  on  port  pj 
goto  A 
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A distributor  is  a determinate  system  with  one  Input  and  N outputs,  which 
transmits  incoming  packets  to  the  output  port  selected  by  a data  field  in  the  packet.  Incoming 
packets  are  assumed  to  be  of  the  form  <port,  data>.  The  distributor  strips  off  the  "port”  field 
in  the  final  result.  An  N-output  distributor  realizes  the  following  function: 


basic  (zero  latency)  distributor  f^ 

If  X is  input  and  V1,  Y2, . . . Y14  are  outputs, 

<‘A  Y2, . . . Yn,  Xa)  € f^X,  Yj,  Y2  . . . Yj)  if 

(1)  Vie  [i,Nl  |Yj|  - number  of  packets  <i,~ > in  X 

® ny-Iitfl 

i - I 

(3)  V i V j,  Yj  - data  where  packet  <i,~>  in  X is  <i,  data> 

Such  a distributor  may  be  implemented  as  follows: 

process  starts  at  A 

input  port  X 

output  ports  Y| . . . Yn 

A:  wait  until  a packet  is  syaiiabls  on  port  X; 

z the  packet  on  port  X:  | do  not  acknowledge  yet 

let  Z «»  <pOrt  , daia>i 

XMTPKKY^,) date; 

send  acknowledge  on  port  XA: 

goto  A 

Higher  latency  distributors  may  be  defined  in  terms  of  basic  distributors  and 


FIFO  buffers. 
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Distributors  are  the  principal  component  of  the  distribution  network  of  the  data 

flow  computer. 


An  allocator  is  a nondeterminate  variation  of  a distributor  which  transmits 
incoming  packets  to  one  of  several  output  ports.  Each  packet  is  sent  to  any  output  port  that 
is  ready  to  receive  it,  that  is,  any  port  that  has  acknowledged  all  previous  packets  sent  to  it. 
An  allocator  is  normally  used  to  send  packets  to  a group  of  identical  units,  always  selecting 
any  unit  which  is  not  busy.  The  structure  controller  of  a data  flow  computer  will  typically  be 
realized  in  the  form  of  several  identical  units  in  order  to  increase  throughput.  Operation 
packets  from  the  instruction  cell*  will  be  sent  through  allocator!  to  tha  atructura  control  units. 
(In  fact,  tha  other  functional  units  of  a data  flow  eomputar  will  ba  handled  the  same  way.)  An 
Pi-output  allocator  realizes,  the  following  function: 

basic  (minimal  latency)  allocator  f^^. 

If  X is  input  and  Y1,  Y2, . . . YN  are  outputs, 

<Y\  Y2, . . . Y".  \)  € lWLOC(X,  Yj,  Y2  . . . Yj)  if 

N 

(1)  £>‘1-1X1 

i • i 

N 

(2)  |XA|  » min  { |X| , N - l + £ |Yj| } 

k • I 

(3)  Y1,  Y2, . . , Yn  are  disjoint  subsequences  of  X 

It  may  be  implemented  by  th«  following  oregrartv 

processes  start  at  A,  B 
input  port  X 
output  ports  Y1  ...  Yn 
queue  q size  N init  (1,  2, ...  N) 
var  pop  init  N 


A. 


<%•  a* f .>  pdiAet  >s  available  ue  pad  .H 


| do  not  acknowledge  yet 


z :•  the  packet  on  peri  X; 

K Is  Item  at  .-wad  of  qj 

pop  :»  pop  - 1} 

send  packet  * on  port  Y*1  j | don’t  wait  for  acknowledge 

until  pop  ? 0 do; 

send  acknowledge  on  port  X^j 

goto  A; 

6:  wait  until  acknowledge  is  available  on  any  port  Y^ , 

lei  p that  port; 

| nondeterminate! 

take  the  acknowledge  from  port  Yj ; 
put  p at  end  of  q; 
pop  :»  pop  ♦ 1; 
goto  B 

The  basic  allocator  given  above  does  not  have  latency  zero  in  the  sense  of  not 
acknowledging  any  input  until  the  resultant  output  has  been  acknowledged  - such  an 
arrangement  would  defeat  the  allocator’s  purpose.  The  system  given  above  does  have  the 
minimum  latency  that  makes  sense. 


3.0  THE  BASIC  MEMORY  MODULE 


In  this  section  a formal  specification  of  tho  memory  module  "MM"  will  be  given. 
MM  is  the  fundamental  building  block  of  the  packet  memory  system.  Each  MM  system  is  a 
memory,  somewhat  like  the  system  NOMEM  described  earlier,  which  handles  a specific  set  of 
addresses.  To  increase  total  information  transfer  rate,  the  address  space  of  the  entire  packet 
memory  system  may  be  divided  into  smaller  pieces,  with  one  MM  unit  handling  each  piece. 
The  MM  units  are  connected  through  arbitrators  and  distributors,  and  form  a system  which  is 
itself  an  MM.  This  is  "horizontal"  composition,  and  is  quite  similar  to  the  interleaving  found  in 
conventional  memory  systems.  To  increase  the  spend  on  individual  transactions  an  MM  unit 
may  have  a cache  module  "CM"  connected  to  it.  MM  Vith  CM  connected  to  it  is  itself  an  MM 
This  is  "vertical"  composition,  and  is  quite  similar  to  the  cache  memories  found  in  high 
performance  conventional  computers. 


MM  has  one  input  port  CMDI  ("command  in")  taking  command  packets  from  its 
user,  and  one  output  port  RESO  ("result  out")  returning  results  to  the  user.  The  memory 
space  is  divided  into  words  or  cells  (the  terms  will  be  used  interchangably),  each  of  which 
corresponds,  to  one  node  of  a structure.  Every  memory  transaction  refers  to  one  word,  and 
every  incoming  or  outgoing  packet  bears  the  address  of  that  word  in  its  address  field  The 
memory  space  is  the  same  size  as  the  address  space,  and  the  size  is  known  to  the  user,  so 
there  can  be  no  "nonexistent  memory  word*  error.  In  most  implementations,  the  memory  size 
would  be  2n  where  ths  address  field  of  every  packet  is  N bits, 

Notation:  FET***  means  any  of  FET,  FET',  or  FET+.  LOAD***  similarly. 


Each  word  in  the  msmory  contains  a data  field  and  a reference  couni  field, 
which  are  used  by  the  structure  controller  as  described  in  section  1.2.  LOAD***  and  UPD 
packets  have  corresponding  fields.  Furthermore,  FET***  packets  have  a tag  field,  which  is 
returned  unchanged  in  the  corresponding  LOAD***  packet. 
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3.0. 1  LATENCY  AND  INITIAL  MEMORY  CONTENTS 

The  specification  of  MM  to  be  given  below  does  not  say  anything  about  latency. 
This  is  because  MM'S  user  is  required  to  acknowledge  every  result  packet.  When  this 
happens,  MM  will  acknowledge  every  command  packet,  regardless  of  what  its  actual  latency  is. 
Hence,  an  accurate  description  of  MM’S  latency  is  unnecessary. 

Initial  memory  contents  will  also  be  left  unspecified.  In  tho  functional 
specification  of  a memory,  the  definition  of  initial  contents  arises  in  the  specification  of  the 
system’s  response  to  a READ  command  that  was  not  preceded  by  a WRITE  The  specification 
of  MM  will  assume  that  this  does  not  occur.  In  an  actual  data  flow  computer,  a free  storage 
list  will  be  generated  when  the  system  starts,  which  requires  writing  on  every  cell. 

3.0. 2  INFORMAL  BEHAVIOR  OF  MM 


There  are  5 types  of  input  packets  to  MM,  and  4 types  of  output  packets: 


FETfaddr,  tag) 


FET+(addr,  tag) 


FETladdr,  tag) 


CLR(addr) 


("fetch")  reads  the  addressed  word  and  returns 
LOAD(addr,  data,  ref,  tag) 

["ref"  is  the  reference  count] 

increases  the  reference  count  by  one  and  returns 
L0AD+(aodr,  data,  ref,  tag) 

["ref"  is  the  reference  count  after  the  increment] 

decreases  the  reference  count  by  one  and  returns 
L0AD"(addr,  data,  ref,  tag) 

("clear")  waits  until  all  FET/lOAD,  FE!  +/L0AD+,  and 
FET'/LOAD’  transactions  on  the  indicated  word  have 


completed,  and  then  returns  DONE(addr) 


UPD(addr,  data,  ref)  ("update")  writes  the  data  and  reference  count 

into  the  addressed  word.  It  returns  no  result, 
and  hence  uses  no  tag. 

MM  is  nondeterminate  as  was  the  example  memory  NDMEM,  in  that  *esult 
packets  referring  to  different  cells  are  not  constrained  to  appear  in  the  same  order  as  the 
commands  that  gave  rise  to  them.  MM  is  further  nondeterminate  in  that  it  may  rearrange 
LOAD***  packets  referring  to  the  same  call.  Such  nondeterminacy  would  not  have  made  sense 
for  NDMEM  since  RTR  packets  with  the  same  data  and  same  address  were  indistinguishable, 
but,  in  the  case  of  MM  LOAD***  packets  may  have  different  tags. 

Since  LOAD***  packets  involve  a change  of  reference  count  and  may  be 
reordered  arbitrarily,  the  question  arises:  What  happens  to  the  reference  counts  appearing  in 
such  packets  if  they  are  reordered?  The  answer  is  that  the  result  packets  have  reference 
counts  consistent  with  their  own  order,  not  the  order  of  the  original  command  packets. 

Example:  Suppose  the  reference  count  of  cell  A is  I,  and  the  command  sequence 

FET+(A,  Tl)  i FET+(A,  T2)  | FET"(A,  T3) : FET~(A,  T4) 

is  sent.  Some  of  the  possible  results  are 

LOAD+(A,  D,  2,  Tl)  i LQAD+(A,  D,  3,  T2) ; LOAD'fA,  D,  2,  T3)  j LOAD'(A,  D,  L T4) 

or 

LOAD'fA,  D,  0,  T3)  i L0AD'(A,  D,  -i,  T4)  > I.CAD+(A,  D,  0,  Tl) » L0AD+(A,  D,  1,  T2) 

The  reference  count  temporarily  becomes  negative! 

The  reference  count  appearing  in  any  LOAD*  packet  is  one  more  than  the  count 
in  the  preceding  LOAD***  packet.  Similarly,  the  count  in  a LOAD”  is  one  less  than,  and  the 
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count  In  a LOAD  is  equal  to,  the  count  in  the  preceding  LOAD(±).  Some  implementations  of  MM 
will  never  reorder  LOAD(±>  packets  referring  to  the  same  address,  although  they  may  reorder 
those  for  different  addresses.  If  this  is  the  case,  the  reference  count  will  never  become 
negative,  which  removes  the  need  for  a sign  bit  in  the  reference  count  field. 

\ 

3.0. 3  INFORMAL  BEHAVIOR  OF  MM’S  USER 

When  the  user  gives  a CLR  command,  it  must  not  send  any  further  commands  of 
any  type  for  the  indicated  cell,  until  the  corresponding  DONE  packet  has  returned.  (The 
purpose  of  the  CLR  command  is  to  clear  out  pending  transactions.  It  would  defeat  its  purpose 
to  continue  sending  commands.) 

Like  NDMEM,  MM  requires  that  no  UPD  command  be  given  while  any 
transactions  are  pending  on  the  indicated  cell. 

3.0. 4  FORMAL  DEFINITION  OF  MM  AND  MMUSER 

These  definitions  do  noi  snow  latency  or  make  any  reference  to  acknowledges. 
The  user  is  required  to  acknowledge  every  result  packet  and  MM  is  consequently  required  to 
acknowledge  every  command.  Both  systems  of  course  obey  the  Standard  Acknowledge 
Restriction.  The  definitions  do  not  consider  the  possibility  of  illegal  packet  types  or  invalid 
fields  In  packets.  All  universal  quantifle.  . ore  intended  to  range  over  a set  that  is  in  each 
case  obvioc  from  context. 

Note:  'n  rules  2,  3,  and  4 the  zeroth  DONE  in  Y means  the  beginning  of  Y.  The 
N+l**  DONE  in  Y,  where  N » the  number  of  DONEs  in  Y,  means  the  end  of  Y.  Similarly  for  CLRs 
In  X.  The  intention  is  to  let  the  DONE  and  CLR  packets  break  up  X and  Y into  intervals,  which 
makes  it  convenient  to  think  of  the  entire  histories  as  being  preceded  and  folic  ./ed  by  DONE 
or  CLR  packets. 
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If  X is  input  and  Y is  output,  Y 6 fMM(X)  if 

(1)  For  si!  addr,  tha  number  of  OONE(addr)  packets  in  Y ■ the  number  of  CLR(addr) 
packets  in  X 

(2)  For  all  addr,  K,  and  tag,  the  number  of  LOAD(addr, tag)  packets  between  the  K,h 
and  K+l**  DONE(addr)  in  Y - the  number  of  FET(addr,tag)  packets  between  tha  K,h 
and  K+  CIRtaddr)  in  X 

(S)  For  ell  addr,  K,  and  tag,  the  number  of  LOAD'Uddr, tag)  packets  between  the 
K,h  end  K+l#t  DONE(addr)  in  Y - the  number  of  FET'(addr,tag)  packets  between  the 
Kth  and  K+!rt  CIRfaddr)  in  X 

(A)  For  all  addr,  K,  and  tag,  the  number  of  L0AD+(addr, tag)  packets  between  the 
K,h  and  K*lrt  DONE(addr)  in  Y - the  number  of  FET+(addr,tag)  packets  between  the 
K,h  and  K+l*‘  CLRfaddr)  in  X 

(5)  For  all  addr,  J,  and  K,  the  J<h  LOAD(±><addr, in  Y is 
LOAB^Haddr.data^ef+D,— ),  where  the  last  UPD(addr,--,— ) before  the  J,h 
FET<:fc)(addr,~)  in  X is  UPD{addr, data, ref)  and  is  preceded  by  I FET(±)<addr,~) 
packets,  and  0 » {number  of  LOAD+( addr, packets)  - {number  of  LOAD 
(addr,--,-,-)  packets)  among  the  I+i*1  to  Jth  L0AD(±){addr, --,-,-)  packets  in  Y. 
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fMMUSER 

If  Y is  inpui  io  user  and  X is  output,  X € fMMUSER(Y)  if 

(1)  For  ali  addr,  either  the  number  of  CLR(addr)  packets  in  X - the  number  of 
OONECaddr)  packets  in  Y,  or  else  there  is  one  mere  ClR(addr)  in  X than  DONE(addr) 
in  Y,  and  there  are  no  FET^W/dr,--)  or  UPOfaddr,--,— ) packets  after  the  last 
CLRtaddr)  in  X. 

(2)  For  all  addr,  for  any  UPCXaddr,— ,-)  in  X,  the  number  of  FET^addr,--)  packets 
preceding  it  is  < the  number  of  LOAD^faddr, packets  in  Y. 

3.0.5  IMPLEMENTATION  OF  MM  USING  A RANDOM  ACCESS  DEVICE 

Implementation  of  Mm  with  a random  access  device  is  quite  easy.  Assume  the 
memory  is  two  arrays,  mem-data  and  mem-ref,  containing  the  data  and  reference  count  for 
each  word,  respectively.  The  following  program  will  suffice: 

process  starts  at  A 

Input  port  CMDI 

output  port  RE80 

array  mem-data,  mem-ref 

var  commend,  sddr,  data,  ref,  tag 

A:  command  :■  RCVPKT(CMDI); 

if  command  - FET< — , — ) then  | FET  - return  LOAD 

let  command  « FETfaddr,  tag); 

XMTPKT(RESO)  :*  LOADf'addr,  mem-dats(sddr),  mem-ref{sddr),  tag) 

else  if  command  « FET”( — ,— ) then  | FET"  - decrement  ref  and  return  LOAD” 
let  command  *>  FET"(tt<idr,  tag); 
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mem-ref(addr) mem-ref(addr)  - lj 

XMTPKTfRESO)  :*  LOAQ”(addr,  mem-data(addr),  mem-ref(addr),  tag) 

else  jf  command  - FET+( — s — ) then  j FET+  - increment  ref  and  return  LOAD+ 
let  command  » F£T(±>(addr,  tagh 
mem-ref(addr) mem-ref(addr)  + 1; 

XMTPKTCRESO)  :■  LOAD+(addr,  mem-dats(addr),  mem-ref(addr),  tag) 

else  N command  - UPD(— ,—,--)  then  | UPO  - update  memory 
command  - UPD(addr,  data,  ref); 
mem-data(addr)  i«  datai 
mem-ref(addr) ref 

else  | CLR  - return  DONE 

{et  command  - CLR(addr); 

XMTPKT(RESO) DONEfaddrh 


goto  A 
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3.1  HORIZONTAL  INTERCONNECTIONS  OF  W SYSTEMS 

The  functional  specifications  of  MM  and  its  user  have  the  useful  properties  that: 

(1)  fMM  and  are  invariant  under  reordering  of  command  packets 

referring  to  different  words.  That  is,  such  a reordering  will  not  affect  the 
legal  responses  from  MM,  nor  will  it  affect  the  legality  of  the  commands 
from  the  user. 

(2)  fyy  and  fyy,,^  are  similarly  invariant  under  reordering  of  result  packets 
referring  to  different  words. 

(3)  fyy  and  f^yg^  are  invariant  under  reordering  of  LOAD(±*  packets  for  the 
same  word  between  any  pair  of  OWE  packets  for  that  word,  assuming  the 
reference  counts  are  suitably  adjusted. 

(4)  the  behavioral  properties  of  MM  and  itr  user  are  completely  independent 
for  different  words. 

Property  (4)  makes  it  possible  to  connect  MM  systems  and  their  users  through 
distributors  and  arbitrators,  and  still  havr  an  MM  system.  The  following  connections  are 
possible: 
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Multiple  memory  connection 

CMDI  RESO 

r mm 


If  each  of  the  small  boxes  realizes  fMM  (contingent  on  its  user  realizing 
f(ftnrTn).  the  large  dashed  box  realizes  f^^^  for  a larger  address  space.  If  the  user  of  the 
large  dashed  box  realizes  fMMUS£R  . each  small  box’es  user  realizes  f^y^R  • 

For  this  to  work  the  distributor  and  arbitrator  must  handle  address  fields 
properly.  If  there  are  2N  small  MM  units,  the  address  field  of  the  interconnection  is  N bits 
longer  than  that  of  the  units.  The  distributor  picks  out  N bits  of  all  incoming  address  fields 
and  uses  them  as  the  output  port  numbers.  (For  interleaving  purposes,  it  might  be  most 
effective  to  pick  out  the  least  significant  bits.)  Those  bits  do  not  appear  in  the  address  fields 
of  the  packets  that  are  sent  to  the  MM  units.  The  arbitrator  inserts  the  input  port  number  of 
each  incoming  packet  into  the  address  field  in  the  same  positions  as  the  bits  that  were 
removed  by  the  distributor. 


This  connection  is  one  of  the  methods  by  which  the  transaction  rate  can  be 
increased.  Random  access  memory  devices  have  the  property  that  every  read  or  write 
transaction  ccuse;  k>  device  to  become  busy  for  some  period  of  time,  during  which  it  cannot 
handle  any  other  (ransactions.  For  example,  a MQS  RAM  might  bs  busy  for  500  nanoseconds 
during  every  transaction,  and  therefore  be  able  to  handle  2 million  transactions  per  second. 
Puttino  t>  FIFO  buffer  on  >!  w!!!  incr**cs  it:  later.:/  the  ta. ...  »»«* 

its  transaction  rate  stays  the  same.  The  only  way  to  increase  the  data  rate  is  to  use  many 
memory  units,  if  a distributor  can  handle  64  million  packets  per  second  on  its  input  port,  and 
an  arbitrator  can  handle  64  million  packets  per  second  on  its  output  port,  it  might  ba 
reasonable  to  use  32  MOS  RAM’s,  each  .n  a separate  MM  unit.  These  are  connected  to  ?.  32 
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port  distributor  and  a 32  port  arbitrator.  The  average  rate  at  which  packets  come  out  of 
each  port  of  the  distributor  is  2 million  per  second,  which  is  the  rate  at  which  individual  units 
can  handle  them.  Assuming  the  commands  are  uniformly  distributed  over  the  address  space, 
this  interconnection  will  handle  64  million  transactions  per  second.  The  retrieval  delay  for 
each  item  will  still  be  5C0  nanoseconds,  but  that  is  an  unavoidable  consequence  of  the 
memory  technology  used. 

For  this  interconnection  to  work  effectively,  the  latency  of  the  individual  MM 
units,  or  the  output  latency  of  the  distributor,  must  be  at  least  one,  and  preferably  more.  If 
the  MM  units  and  the  distributor  all  have  latency  zero,  the  distributor  will  be  unable  to 
acknowledge  a command,  and  hence  unable  to  get  the  next  one,  until  the  command  has  been 
completely  processed  by  the  MM  unit.  This  would  defeat  the  purpose  of  using  multiple  units. 
In  practice,  the  latency  might  be  somewhat  more  than  one,  in  order  to  maintain  a transaction 
rate  near  the  maximum  in  the  presence  of  nonuniform  statistical  frequency  of  commands  for 
each  unit.  This  can  be  accomplished  by  placing  a small  FIFO  buffer  between  the  distributor 
and  each  MM  unit. 
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Multiple  user  connection 


This  is  just  like  the  multiple  momory  connection,  but  with  the  roles  of  MM  end  the  user 
exchanged.  If  the  solid  box  realizes  , each  of  the  interfaces  at  the  top  of  the  diagram 

realizes  1m  for  s smaller  address  space.  If  each  of  the  users  of  this  interconnection  realizes 
*MMUS€R  ’ th®  collection  of  all  of  them  along  with  the  arbitrator  and  distributor  realizes 
f> IT fL~tn  011  ^ ',r8®  address  space. 

As  in  the  previous  case,  the  arbitrator  must  map  the  input  port  number  into  a 
larger  address  field,  and  and  distributor  must  remove  the  corresponding  part  of  the  address 
field  and  use  it  us  the  output  port  number.  Each  of  the  interfaces  at  the  top  of  the  diagram 
realizes  »n  equivalent  address  space,  and  each  uses  e different  subsei  of  the  memory  space 
contained  :n  the  actual  MM  unit. 

This  connection  would  be  used  if  there  were  several  users,  each  presenting 
commands  at  such  a slow  rate  that  one  memory  module  could  handle  all  of  them.  Such  a 
situation  could  arise  if  several  cache  modules  are  used  which  have*  a sufficiently  high  "hit" 
rate  that  the  rate  of  memory  requests  arising  from  cache  misses  is  lov/. 
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3.2  VERTICAL  COMPOSITION  AND  THE  CACHE  MODULE 

In  the  section  we  describe  the  cache  module  "CM"  which  connects  to  an  MM 
system  and,  so  connected,  realizes  an  MM  system  with  the  same  address  space. 


MM 


CMDI 


RESO 


CM 


MEMO 

CMDI 


MEMI 

RESO 


MM 


If  the  small  box  labelled  MM  realizes  fMM  , the  large  dashed  box  realizes  fMM  . 
If  the  user  of  the  large  dashed  box  realizes  fMMUSER  , the  user  of  the  small  box  realizes 

fMMuser 


Vertical  and  horizontal  interconnections  may  be  mixed  as  in  the  following 
examples,  since  in  each  case  the  system  being  implemented  is  MM. 
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The  purpose  of  a cache  Is  to  retain  the  data  of  a small  subset  of  the  main 
memory’s  address  space,  and  return  requests  for  data  in  that  subset  directly  without  reading 
It  from  main  memory.  Since  the  cache  has  much  less  data  than  the  main  memory,  it  can  be 
built  out  of  faster  circuits  and  devices  without  being  prohibitively  expensive.  Hence  »>,/ 
request  for  a datum  that  is  in  the  cache  (a  "cache  hit")  is  answered  very  quickly.  If  the  cache 
Is  sufficiently  well  designed  that  it  has  a high  hit  rate,  the  overall  performance  of  the  memory 
will  be  nearly  as  good  as  that  of  the  cache  itself. 

A cache  must  be  designed  to  maximize  the  hit  rate  by  holding  those  memory 
items  that  are  likely  to  be  addressed.  This  is  usually  dons  by  assuming  that  the  addresses 
being  used  vary  6lowly  with  time,  and  so,  when  an  item  is  referred  to  once,  it  is  likely  to  be 
referred  to  again  soon,  and  should  be  placed  in  the  cache.  Therefore,  when  an  item  is 
addressed  which  is  not  in  the  cache  (a  "cache  miss"),  the  datum  is  fetched  from  main  memory, 
pieced  in  the  cache,  and  also  returned  to  the  user.  Subsequent  requests  for  that  datum  will 
be  cache  hits. 


The  size  of  the  "items"  that  the  cache  contains  affect  its  performance.  A cache 
for  the  main  memory  of  a conventional  computer  may  use  rather  large  items  consisting  of,  for 
ex  mple,  8 consecutive  words.  This  is  effective  because  references  to  memory,  especially 
instruction  fetches,  tend  to  be  localized  in  space.  When  a cache  miss  occurs  on  any  wore,  a 
block  of  8 consecutive  words  is  reod  from  main  memory  and  'oaded  into  She  cache.  Since 
references  in  the  immediate  future  are  likely  to  be  ir,  this  block,  the  hit  rate  is  • eased. 
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The  structure  memory  for  a data  flow  computer  has  no  such  locality  of 
reference,  Therefore,  the  unit  of  cache  organization  will  be  the  individual  word 

Placing  an  item  in  the  cache  usually  requires  removing  some  other  item.  The 
most  popular  strategy,  and  the  ono  that  will  be  used  here,  is  the  "least  recently  used”  (LRU) 
strategy.  Each  reference  to  a cache  item  is  noted  in  some  sort  of  reference  table.  When 
space  must  be  made  in  the  cache  for  a new  datum,  the  item  that  has  been  used  least  recently, 
that  is,  has  gone  the  longest  lime  without  a reference,  is  chosen. 

When  a write  command  is  issued,  the  item  in  the  cache  is  updated 
appropriately.  In  some  cache  organizations,  the  item  in  main  memory  is  always  updated  also. 
This  technique,  known  as  "write  through",  will  not  be  used  here.  Instead,  the  item  in  the 
cache  will  simply  b.  marked  as  having  been  modified.  When  an  item  that  has  beon  modified 
must  be  displaced  from  the  cache,  it  is  first  written  into  main  memory.  This  method  has  a 
lower  volume  of  commands  going  from  the  cache  to  mein  memory  than  the  "write  through" 
method. 


It  is  crucial  that  the  cache  be  able  to  determine  very  quickly  whether  or  not  it 
contains  a given  word.  Since  its  memory  space  is  much  smaller  than  the  full  address  space,  it 
must  store  the  full  address  with  each  item.  When  a command  is  received,  the  cache  must  be 
.earphed  for  an  item  with  the  given  address.  It  is  important  that  the  search  be  conducted 
quickly. 


A popular  method  of  organizing  the  cache  for  rapid  searching  is  the  "set 
associative"  memory  [12] . The  cache  is  organized  ar  an  array  of  columns  and  rows.  The  full 
address  space  is  similarly  organized,  with  the  same  number  of  columns,  and  a presumably 
much  greater  number  of  rows,  Each  item  in  the  cache  is  constrained  to  corresoond  to  the 
same  coiutnn  in  the  full  address  space  as  its  own  column  in  the  cache.  Therefore,  to  search 
for  a given  item  whose  full  address  is  known,  the  address  is  separated  into  row  and  tolumn. 
If  it  is  m the  cache,  it  must  be  in  the  same  column  as  Its  column  address  in  the  real  memory, 
so  on,/  thet  column  of  the  ct>vhe  rtasd  to  ba  searched.  Furthermore,  only  row  addresses  need 
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to  be  stored  in  the  cache  along  with  the  items.  The  column  addresses  are  implicit  from  the 
position  in  the  cache. 


This  organization  works  well  for  a suprisingly  small  number  of  rows  in  the 
cache.  For  example,  the  main  memory  cache  on  the  IBM  370/168  computer  has  only  four 
rows.  (The  number  of  rows  is  referred  to  as  "cache  depth”.)  To  determine  whether  a given 
item  is  in  the  cache,  only  four  address  comparisons  need  to  be  made.  These  can  easily  be 
done  simultaneously. 

The  column  number  of  a word  in  the  full  address  space  is  typically  taken  from 
the  low  bits  of  its  address.  The  row  number  comes  from  the  remaining  bits.  This  allows 
consecutively  addressed  items  to  reside  in  the  cache  in  adjacent  columns  of  one  row. 


Example:  Suppose  the  full  address  space  contains  4096  addresses,  and 
addresses  consist  of  four  octal  digits.  There  are  8 columns,  and  the  low  digit  of  the  address 
is  the  column  number.  The  cache  depth  is  three. 


column  number 


0 1 2 

row  address  551  550  543 
data  ABC 

row  address  412  417  447 
data  I J K 

row  address  242  242  242 
data  Q R S 


3 4 5 6 7 

504  444  425  425  425 

0 E F G H 

313  314  315  270  241 

L M N 0 P 

242  246  271  365  413 

T U V W X 


The  cache  holds  the  item  with  address  4472,  with  uata  “K".  When  a command  is 
received  requesting  thu  contents  of  location  4472,  the  address  is  divided  into  the  row  (447) 
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and  the  column  (2).  Column  2 of  the  cache  is  then  searched  for  447.  It  contains  543,  447, 
and  242.  447  is  compared  with  these  three  numbers  simultaneously.  It  mstches  the  second  of 
them,  so  the  data  associated  with  it  (K)  is  returned  to  the  user. 

When  a new  item  is  to  be  put  into  the  cache,  its  column  number  is  known  in 
advance,  so  only  its  row  must  be  determined  by  searching  the  column  for  the  least  recently 
used  item.  For  example,  if  an  entry  for  2124  must  be  created,  column  4 is  searched.  If  the 
least  recently  used  item  is  314,  it  is  removed.  If  its  "modify"  bit  is  on,  an  UPD  packet  is  sent 
to  main  memory,  containing  the  address  (3144)  and  the  data  (M).  The  row  address  is  then 
changed  to  212. 

The  determination  of  which  item  in  a column  was  least  recently  used  can  be 
made  by  some  simple  scheme  such  as  keeping  a counter  along  with  the  data  for  each  item. 
Whenever  any  reference  is  made,  that  item’s  counter  is  set  to  zero  and  all  others  in  its  column 
are  increased  by  one.  The  least  recently  used  item  is  the  one  with  the  highest  count. 

Because  each  operation  in  the  cache  involves  examination  of  an  entire  column, 
the  cache  memory  itself  should  be  organized  so  that  each  column  is  a "word",  that  is,  the 
entire  column  is  read  or  written  at  once. 

3.2.1  DESIGN  OF  CM 

The  functional  specification  of  CM  is  very  simple:  it  must  realize  fMM  through 
Its  "top"  ports  and  realize  ^mmuser  ^rou8b  its  "bottom"  ports. 
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An  implementation  of  a system  realizing  fCM  will  now  be  given.  Each  word  of 
the  full  address  space  is  in  one  of  eight  states  denoted  N,  P,  P’,  Q,  Q\  R,  R’,  snd  T. 

N - The  word  is  not  in  the  cache  at  ail.  (Since  the  cache  is  much  smaller  than 
the  full  address  space,  most  words  are  in  this  state  at  any  instant.)  There 
are  no  pending  commands  from  the  user  to  the  system.  There  are  no 
pending  commands  from  the  cache  to  the  main  memory. 

P - Space  has  been  reserved  in  the  cache  for  the  word,  and  at  least  one  FET{±) 
has  been  sent  to  main  memory,  but  no  LOAD**5  has  come  back.  One  or  more 
FET(*)/LOAD<±>  transactions  are  pending  to  the  cache.  Exactly  the  same 
transactions  are  pending  to  the  main  memory. 

P*  - Same  as  P,  but  a CLR  packet  has  been  received  from  the  user.  One  or 
more  FET(±Vt-OAO<±>  transactions,  plus  a CLR,  are  pending  to  the  cache. 

The  same  transactions  without  the  CLR  are  pending  to  the  main  memory. 

Q - The  first  L0AD(±>  has  come  back  from  main  memory.  A CLR  packet  will  be 
sent  as  soon  as  main  memory  is  ahl»  to  arrant  a 7cm  nr  mem 
FET(±Vl»OAD(±)  transactions  are  pending  to  the  cache.  Exactly  the  same 
transactions  are  pending  to  the  main  memory. 


Q’  - Sams  as  Q,  but  a CLR  packet  has  been  received  from  the  user.  Zero  or 
more  FET(±VlOAD<±>  transactions,  plus  a CLR,  are  pending  to  the  cache. 

The  same  transactions  without  the  CLR  are  pending  to  the  main  memory. 

R - The  word  is  in  the  cache,  but  some  FET^/LOAD^  transactions  may  still  be 
in  progress  in  main  memory.  A CLR  packet  has  been  sent  to  remove  them. 

No  CLR  packet  has  been  received  from  the  user.  Zero  or  more 
FET{±>/LOAD'±>  transactions  are  pending  to  the  cache.  The  same 
transactions,  plus  a CLR,  are  pending  to  the  main  memory. 

R’  - Same  as  R,  but  a CLR  packet  has  been  received  from  the  user.  Zero  or 
more  FET<±Vt-0AD<±>  transactions,  plus  a CLR,  are  pending  to  the  cache. 

Exactly  the  same  transactions  are  pending  to  the  main  memory. 

T - The  word  is  truly  in  the  cache.  There  are  no  pending  transactions  to  the 
cache  or  from  the  cache  to  the  main  memory. 

The  normal  states  for  a word  are  N or  T,  depending  on  whether  the  word  is  in 
the  cache  or  not.  In  state  T,  ali  commands  are  acted  upon  immediately  by  the  cache  without 
any  communication  with  main  memory.  In  stale  N,  any  command  from  the  user  causes  the 
word  to  undergo  transitions  that  eventually  result  in  its  being  in  state  T.  If  the  command  is  a 
FET(±>,  the  word  must  be  read  from  main  memory,  snd  the  s*Me  goes  through  some  of  the 
intermediate  states.  If  the  command  is  UPD,  the  word  is  created  in  the  cache  in  state  T.  In 
either  caso,  some  other  word  may  have  to  be  displaced,  going  from  slats  T to  state  N.  If  the 
"modify*  flag  for  that  word  is  on,  an  UPD  packet  is  sent  to  main  memory. 

Th8  specifications  of  MM  and  its  user  require  that  the  user  accept  ali  result 

ki«2  iili  in  akIu  trt  <*.«»•  urkaA  thn  life  rtl  »v4*a,mAi  • «» 

commands  have  been  accepted  by  the  user  {although  an  efficient  implementation  of  MM  might 
allow  many  commands  to  be  in  progress  at  once.)  Theretor*,  *n  order  to  avoid  a deadlock,  CM 
must  accept  packets  from  main  memory,  at  MEM!,  c-v&n  -.f;1,.  mam  memory  refuses  to  accept 
any  further  commands  through  MEMO.  CM  sometimes  m.jst  -vni  for  memory  to  accept  a 
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command.  While  it  is  waiting,  it  may  refuse  to  accept  further  commands  at  CMDI,  but  it  must 
always  be  willing  to  accept  packets  at  MEMl.  CM  may  assume  that  any  packet  sent  through 
RESO  will  be  accepted. 

The  reason  why  CM  allocates  a cache  cell  for  an  item  and  puts  it  into  state  P as 
soon  as  the  first  FET(±>  command  comes  from  the  user,  is  to  avoid  a deadlock,  that  is,  a 
situation  from  which  the  system  cannot  proceed.  If  it  simply  sent  the  packet  out  through 
MEMO  and  did  not  allocate  the  cache  cell  until  the  first  10AD(±)  packet  came  back,  it  would 
use  its  own  space  more  efficiently,  but  would  be  in  danger  of  deadlock.  (P  cells  are  useless, 
since  thsy  do  not  contain  data.)  This  will  be  explained  In  section  6.0. 

)n  the  following  description  of  the  cache  algorithm,  the  manipulation  of  the 
counters  to  determine  the  least  recently  uied  item  if  not  shown. 

STATE  N 

FE7(±5(addr,  tag)  a'.  CMDI  - viraate  space  in  the  appropriate  cache  colurr  .. 

Either  use  an  empty  space  (this  situation  can  only  arise  when  the  syriem  is 
fir''  started)  or  ro move  the  ki,st  recent1,  used  item  in  state  T.  U no  Item 
is  in  state  T,  wait  until  one  enters  state  T,  not  accepting  ..ny  packets  on 
CMDI  voile  waiting.  (Items  in  other  states  will  progress  to  state  T./  When 
the  item  to  be  removed  Is  found,  Vfrita  it  out  if  its  ’’modify’’  flog  is  on,  by 
sending  an  UPO  packet  at  MEMO.  If  main  memory  is  no',  accepting  1 ackets 
at  MEMO,  w.-st  until  it  does.  Than  create  a new  item  ir  the  cache  the 
given  address,  ’’modify"  ■ 0,  state  * P.  Leave  the  data  and  refeiance  count 
fields  unspecified.  Also,  send  a FET(A)  packet,  Identical  to  the  ‘ncoming  0 
out  through  MEMO,  tu  fetch  the  date. 

CLR(adcJr)  at  CMDI  - send  DONE(addr)  at  RESO. 

UPOtaddr,  data,  ref)  at  CMD!  - Create  space  in  the  cache  as  far  FET(i),  perhaps 
sanding  ?n  UPD  packet  to  memory  Then  create  5 ■ssrn  in  she  tjeNr 
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with  the  given  address,  "modify*  - 1,  data  and  reference  count  from  the 
command,  and  state  - T. 

LOAD(±)  or  DONE  at  MEMI  - can’t  occur  because  no  transactions  are  pending  in 
main  memory. 

STATE  P 

FET<:t)(addr,  tag)  at  CMOI  - Send  the  same  packet  at  MEMO. 

CLRfaddr)  at  CM)!  - Change  to  state  F. 

UPCKaddr,  data,  ref)  at  CMOI  - can’t  happen,  since  transactions  are  pending  in 
the  cache. 

LOAD(±){addr,  data,  ref,  tag)  at  MEMI  - Deposit  the  data  and  reference  count 
into  the  cache  word,  and  send  the  same  packet  out  at  RESO.  If  the  main 
memory  is  accepting  commands,  send  a CLR(addr)  at  MEMO  and  change  this 
cache  item  to  state  R.  If  not,  change  to  state  Q. 

DONE  at  MEMI  - can’t  happen,  since  no  CLR  has  been  given  to  main  memory. 

STATE  P’ 

FET(±*,  UPD,  or  CLR  at  CMD1  - can’t  happen,  since  user  has  a CLR/DONE 
transaction  pending. 

LOAD'^taddr.  data.  ref.  taa)  at  MEMI  - DaDOsit  the  data  and  reference  count 
into  the  cache  word,  and  send  the  same  packet  out  at  RESO.  If  the  main 
memory  is  accepting  commands,  send  a CLR(ackir)  at  MEMO  and  change  this 
cache  item  to  state  R\  II  not,  change  to  state  Q\ 
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DONE  at  MEMI  - can’t  happen,  dnca  no  CLR  hat  been  given  to  main  memory. 
STATE Q 

Note:  CM  does  not  accept  any  command  at  (M)I  whenever  any  item  is  in  state 
Q.  Q is  simply  a temporary  state  that  is  waiting  to  send  a CLRfaddr)  out  through 
MEMO  and  go  into  state  R. 

FET(±>,  UPD,  or  CLR  at  CMDI  - can’t  happen,  since  cache  is  not  accepting 
commands. 

LOAD***  at  MEM!  - tame  at  state  R. 

DONE  at  MEMI  - can’t  happen,  since  CLR  has  not  been  sent  to  main  memory. 

Main  memory  becomes  able  to  accept  a command  - Send  CLR(addr)  through 
MEMO,  change  to  state  R. 


STATE  O’ 

Note:  CM  does  not  accept  any  command  at  CMD!  whenever  any  item  ic  in  state 
Q’.  Q’  is  simply  e temporary  state  that  is  waiting  to  send  a CLR(addr)  out 
through  MEMO  and  go  into  state  R*. 

frET<±),  UPD,  or  CLR  at  CMDI  - can’t  happen,  since  cache  is  not  accepting 
commands. 

LOAD<:t>  at  MEMI  - same  as  state  R. 

DONE  at  MEMI  - can’t  happen,  since  CLR  has  not  been  sent  to  main  memory. 


Main  meniCrx  becomas  abiss  10  accept  a command  - Send  CLR(addr;  through 
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MEMO,  chaw  to  stata  R*. 

STATER 

FET***<addr,  tag)  at  CMC)  * Update  the  reference  count  in  the  cache,  and  set 
the  "modify*  bit  if  thti  packet  we*  FET"  or  FET+.  Send  LOAD***(addr,  data, 
newref,  tag)  through  RESO,  where  data  and  newref  are  current  contents  of 
the  cache.  Note:  at  the  instant  this  happens,  there  may  still  be 
FET(**/L0AD**>  transactions  pending  in  main  memory.  If  so,  those  FET*** 
packets  were  earlier  than  this  one,  but  the  corresponding  LOAD***  packets 
won’t  be  returned  until  later.  This  is  the  circumstance  which  causes  the 
general  system  MM  to  occasionally  return  LOAD***  packets  in  an  order 
different  from  that  of  the  FET***  packets. 

UPDtaddr,  data,  ref)  ct  CMDI  - Updcie  the  cache,  set  the  "modify”  bit.  Note:  if 
an  UPD  packet  is  received  while  in  state  R,  we  know  from  the  rules  for 
MMUSER  that  no  FET***/LOAO***  transactions  are  ponding  in  main  memory. 

CLR(addr)  at  CMDI  - Change  to  state  R’. 

LOAD***(add!',  data,  ref,  tag)  ct  MEMI  - Ignore  tha  "ref"  field  in  the  packet. 
Increment  or  decrement  the  reference  coui.t  in  the  cache  if  the  packet  is 
LOAD"  or  LOAD4-.  Do  not  set  the  "modify"  flag,  since  main  memory  already 
knows  about  the  reference  count  change.  Send  L0AD***{addr,  data,  newref, 
tag)  through  RESO,  where  newref  • the  updated  reference  count  in  the 
cache. 


OONEuddr)  at  MEMI  - Change  to  state  T. 


STATE  JT 
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FET<*),  UPD,  or  CLR  at  CMOI  - can’t  hipptn,  ainca  uaar  hat  a CIR/DONE 
transaction  pending. 

LOAO(:fc)  at  MEMI  - same  as  state  R. 

DONE(eddr)  at  MEMI  - send  DONE(addr)  through  RESO,  change  to  state  T. 

STATE T 

FET(±)(addr,  tag)  at  CMOI  - Update  the  reference  count  in  the  cache,  and  set 
the  "modify"  bit  if  the  packet  was  FET"  or  FET+.  Send  LOAD(±)(«ddr,  data, 
newref,  tag)  through  RESO,  where  data  and  newref  are  current  contents  of 
cache. 

UPtXaddr,  data,  ref)  at  ChOI  - Update  the  cache,  set  the  "modify"  bit. 

CLRfaddr)  at  CMOI  - Send  DONE'addr)  through  RESO. 

LOAD***  or  OONE  at  MEMI  - can't  happen,  since  there  are  no  pending 
transactions  in  main  memory. 

3.2.2  PROOF  OF  CORRECTNESS  OF  CM 

A proof  of  CM’S  correctness  is  generally  similar  to  that  of  the  system  MEM 
given  in  section  2.0.3.  The  memory  state  required  in  the  specification  is  the  contents  of  the 
■as!  UPD  pschst  In  the  Input  history.  One  must  snow  mat,  for  a cell  in  states  Q,  Q’,  R,  R’,  or  T, 
the  data  in  the  cache  itself  is  the  same  as  that  in  the  last  UPD  packet  at  CMOI,  and,  if  the 
modify  bit  is  off,  this  data  is  in  main  memory  also.  For  states  N,  P,  and  P’,  the  correct  data  is 
in  main  memory,  that  is,  the  last  UPD  at  CMDI  has  the  same  data  as  the  last  UPD  at  MEMO. 
These  properties  muat  be  shown  to  be  preserved  for  all  state  transitions,  end  It  must  be 


shown  that  all  legal  FET<*>  commands  will  gat  the  correct  data.  Furthermore,  the  effect  of 
reference  count  modifications  resulting  from  FET*  and  FET ' commands  must  be  taken  into 
account. 


83 


4.0  IMPLEMENTATION  OF  MM  USING  A "ROTATING*  DEVICE 

"Rotating"  memories  such  as  charge  coupled  device  (CCD)  or  "magnetic  bubble" 
shift  registers,  or  magnetic  disks,  are  rightly  considered  to  be  essentially  unusable  for  the 
main  memory  of  a computer  because  of  their  excessive  retrieval  daisy.  In  a data  flow 
computer,  total  transaction  rate  is  as  important  a criterion  as  retrieval  delay,  and  so  the 
disadvantages  of  these  devices  largely  disappears,  making  them  perhaps  economical  as  a mass 
store.  On  the  other  hand,  further  improvements  in  RAM  technology  may  render  these  shift 
registers  obsolete  for  most  applications.  This  section  is  predicated  on  the  assumption  that 
CCD’s  or  bubble  memories  will  be  economical  and  useful  in  the  packet  memory  system. 

In  a rotating  memory,  the  data  is  structured  !n  a ring  which  "rotates”  past  a 
"read/write  head".  Equivalently,  one  may  think  of  it  as  a fixed  ring  and  a pointer  rotating 
•round  the  ring,  with  memory  tr  ansadions  permitted  only  on  the  cell  currently  pointed  to.  If 
the  addresses  of  words  correspond  to  fixed  places  on  the  ring,  it  is  possible  to  predict  when 
any  given  cell  will  be  pointed  to.  Commands  from  the  user  can  be  stored  in  a memory 
somewhat  like  a queue,  sorted  by  position,  so  that  the  pending  transaction  at  the  head  of  the 
queue  it  always  (or  nearly  always)  the  one  that  the  pointer  will  reach  next.  This  will  ms** 
optimal  use  of  the  availability  of  data  from  the  CCD. 

There  are  a number  of  CCD  architectures  currently  in  use.  In  the  "line 
addressed  random  access  memory"  (LARAM),  only  a small  part  of  the  device  shifts  at  full 
speed  at  any  one  time.  The  rest  shifts  and  recirculates  at  a much  lower  speed  in  order  to 
conserve  power.  The  intent  is  to  make  the  device  behave  somewhat  like  a random  access 
memory.  To  retrieve  any  one  item,  one  finds  the  section  in  which  that  item  is  stored,  and 
directs  the  CCD  to  shift  that  section  at  high  speed  until  the  desired  item  is  found.  While  this 
is  happening,  the  other  sections  are  shifting  much  more  slowly,  so  this  architecture  is  not 
efficient  when  many  items  are  being  sought  at  one  time,  it  is  therefore  not  swtaDie  for  tne 
type  of  packet  memory  system  being  considered  here. 

Two  other  types  of  CCD’s  are  the  "serpentine",  which  is  simply  a long  shift 
register  (it  "snakes'1  back  and  forth  on  the  1C  chip),  and  the  “serial-parallel-serial",  which  is 
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simply  a collection  of  interleav-id  shift  registers.  These  two  types  differ  only  in  engineering 
specifications  such  as  data  n?v  and  power  consumption.  They  both  behave  like  long  shift 
registers,  and  hence  are  suitable  for  the  type  of  memory  under  discussion. 

There  are  a number  o?  implementation  considerations  that  must  be  taken  into 
account  in  designing  a rotating  packet  memory.  For  example,  a number  of  shift  registers,  one 
for  each  bit  of  a data  word,  may  be  used,  so  that  a new  data  word  comes  into  position  on 
each  clock  pulse.  On  the  other  hand,  a single  shift  register  might  be  used,  with  each  word 
stored  serially,  or  any  arrangement  between  these  two  extremes  can  be  used.  One  might  also 
use  an  unusual  correspondence  between  ■ .riress  and  shift  register  position.  All  of  these 
considerations  are  irrelevant  to  the  structure  being  considered,  so  we  will  assume  the  memory 
is  a ring  of  full  words,  ordered  by  address,  with  address  zero  following  the  highest  address, 
and  the  pointer  scanning  the  ring  in  order  of  increasing  address.  Any  other  implementation  is 
equivalent  to  this. 

In  the  following,  the  memory  wlif  be  referred  to  as  the  "CCP",  regardless  of 
what  type  of  device  it  actually  is. 

Pending  transactions  {that  Is,  packets  received  at  CMOI)  tire  stored  in  the 
transaction  list  (TL),  which  is  presumably  much  smaller  than  the  memory  itself.  The  Tl  is 
presumably  realiztd  with  a random  accass  memory  device.  In  order  to  avoid  moving  data  in 
the  TL  unnecessarily,  it  has  a ring  structure  lust  like  the  memory.  Transactions  are  placed  in 
the  TL  at  or  near  the  same  angular  position  as  the  position  in  memory  of  the  word  to  "'hich 
they  refer.  Since  the  TL  is  a smaller  ring  than  the  memory,  each  address  of  TL  corresponds 
to  many  consecutive  addresses  of  memory. 

Let  <X>  be  the  function  mapping  addresses  in  the  entire  address  spece  into 
the  corresponding  address  in  the  TL  This  is  called  the  hash  function  for  reasons  that  will  be 
expiamea  itier.  is  jusi  tne  integer  pari  of  me  quotient  of  x oivioed  oy  tne  ratio  of 

memory  size  to  TL  size.  In  a realization  in  which  all  sizes  aro  powers  of  two,  €X>  is  just  the 
appropriate  number  of  high  order  bits  of  X. 


85 


When  a command  Is  received  for  address  X,  the  command  packet  Is  placed  in 
th&  TL  at  address. <X>,  or  the  first  free  address  thereafter  if  <X>  is  full.  Assuming  a 
uniform  distribution  of  addresser  appearing  in  commands,  the  TL  should  be  uniformly  filled. 
As  the  memory  pointer  rotates  through  the  memory,  another  pointer,  maintaining  about  the 
same  angular  position,  rotates  through  the  TL,  picking  out  the  next  transaction  to  perform. 

The  TL  is  organized  much  like  the  "ordered  hash  table”  devised  by  Amble  and 
Knuth  [2] , with  modifications  to  allow  for  its  circularity  and  for  the  fact  that  items  are  being 
removed  from  it.  In  an  ordered  hash  table,  each  item  has  a hash  address.  It  is  placed  in  the 
table  at  its  hash  address  or  in  the  contiguous  block  of  items  after  the  hash  address.  This 
block  is  in  increasing  order  of  data  value.  This  ordering  makes  It  possible  to  determine 
whether  an  item  is  in  the  table  much  more  quickly  than  in  a conventional  hash  table. 

Although  ordered  hash  tcbles  are  intended  for  entirely  different  applications 
than  the  transaction  list  of  a packet  memory,  the  concept  is  well  suited  to  this  application. 
The  "value"  of  an  item  in  the  table  Is  th*  word  address  apper,.  Ing  in  the  p^ket.  Let  a(P) 
denote  this  address  for  packet  P,  and  call  it  the  "CCD  address".  The  "hash  address" 
corresponding  to  "D  address  X is  just  <X*\  defined  earlier.  (Hash  functions  are  usually 
designed  to  be  random,  but  that  property  is  not  desirable  here.)  The  hash  address  of  packet 
P is  therefore  <a(P}>. 

Because  the'  TL  is  a ring  instead  of  a linear  list,  a different  definition  of  order  is 
needed.  The  concepts  of  "greater  than"  and  "less  than"  are  replaced  by  "clockwise  from"  and 
"counterclockwise  from”.  Since  any  item  is  both  clockwise  and  counterclockwise  from  any 
other  item,  the  order  of  two  items  must  be  defined  relative  to  a third.  This  is  done  through 
the  use  of  intervals  denoted  in  ordinary  mathematical  notation.  [X,  Y]  is  the  interval  from  X 
clockwise  to  Y.  If  X < Y,  it  has  its  customary  meaning.  If  X > Y,  [X,  Y]  is  the  set  of  number* 
from  X up  to  the  highest  address,-  s"d  from  zsro  to  Y.  vpen  duw  naif 
Intervals  have  their  customary  meaning,  that  is,  [X,  Y)  means  [X,  Y]  exclusive  of  Y,  etc.  [X,  Y) 
and  [Y,  X)  are  clearly  complements  of  each  other  if  X * Y. 


The  ordering  of  hash  addresses  and  word  addresses  is  expressed  in  terms  of 
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whether  or  not  en  element  is  in  an  interval,  Z € [X,  Y)  means  that  if  one  starts  at  X and 
moves  clockwise,  one  reaches  Z before  Y, 

The  general  ruie  for  mainlining  order  in  TL  is  that,  if  one  goes  clockwise  from 
an  item’s  hash  address  to  the  item  itself,  one  wilt  not  pass  any  empty  cells  and  will  pass  only 
"smaller”  items,  that  is,  items  whose  hash  addresses  are  counterclockwise  from  this  one.  This 
ie  best  illustrated  with  a diagram.  Let  CCD  addresses  be  two  octal  digits  and  hash  addresses 
be  one  digit.  The  hash  function  picks  out  the  first  digit.  The  transaction  list  has  8 cells  end  is 
drawn  as  a circle. 


Cells  0 and  6 are  empty.  Cell  2 contains  a ,'acket  with  address  16,  whose  hash 
address  is  1 but  was  displaced  because  cell  1 is  full. 

It  is  possible  for  the  transaction  list  to  contain  several  packets  referring  to  the 
same  CCD  address.  Specifically,  the  following  configurations  are  possible: 

One  or  more  FET(±^  packets.  When  the  CCD  pointer  reache  the  appropriate 
address,  its  data  will  be  read  and  sent  back  to  the  user  in  a sequence  of 
LOAD^  packets. 

One  or  more  F£T(±^  packets,  followed  by  a CLR.  When  the  CCD  pointer  reaches 
the  appropriato  address,  the  lOAD(i'^  packets  will  be  sent  out,  followed  by 
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a DONE  packet. 

A single  UPD  packet.  The  data  will  be  written  into  the  CCD  when  the 
appropriate  address  is  reached. 

No  other  states  are  possible.  This  is  because  it  is  a violation  of  'y^ER  *°  senc* 
an  UPD  packet  when  there  are  FET(±^  or  CLR  packets  pending.  If  an  UPD  is  given  when  an 
UPD  is  already  pending,  the  new  one  simply  replaces  the  old  one.  If  a FET(±^  is  given  when 
an  UPD  is  pending,  the  data  is  taken  directly  from  the  pending  UPD  packet  and  returned  In  a 
LOAD(±)  packet. 

Intuitively,  the  rule  for  a well  formed  transaction  list  is  that  the  tines 
progressing  clockwise  from  a cell  to  thoso  items  with  that  cell’s  hash  address  must  never 
cross  each  other  or  pass  over  an  empty  cell.  If  an  item  with  CCD  address  43  were  placed 
into  cell  6,  this  rule  would  be  violated,  since  the  line  from  4 to  43  would  cross  the  line  from  5 
to  55.  The  insertion  algorithm  must  instead  put  the  43  into  cell  5 and  move  the  55  to  cell  6. 
Furthermore,  ail  items  with  the  same  hash  address  must  be  ordered  by  CCD  address.  In  the 
example,  16  is  clockwise  from  11. 

To  insert  an  item,  start  at  its  hash  address  and  search  clockwise  until  an  empty 
ceil  or  a celi  containing  an  item  with  higher  (more  clockwise)  CCD  address  is  found.  In  the 
former  case,  insert  the  new  item.  In  the  latter  case,  insert  the  new  item  after  making  space 
for  it  by  pushing  the  old  item,  and  all  those  contiguously  following  it,  one  space  clockwise.  In 
the  example,  insertion  of  item  10  would  require  pushing  11,  16,  25,  32,  and  55  clockwise. 
Insertion  of  42  would  require  pushing  only  the  55. 

While  incoming  command  packets  are  being  placed  in  the  TL  by  the  above 
procedure,  packets  are  being  reeved  and  sent  ’,3  the  CCD  memory.  This  is  accomplished 
through  the  use  of  a transaction  list  pointer  (TIP)  which  rotates  clockwise  roughly  in 
synchronization  with  the  CCD  address  pointer.  When  the  the  CCD  pointer  points  to  CCD  cell 
10,  the  TLP  points  to  TL  address  1.  Since  a packet  for  address  11  is  found  there,  it  waits  until 
the  CCD  pointer  “11,  removes  the  packet  from  the  TL,  and  performs  the  indicated  operation 
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on  the  contents  of  CCO  address  1 1.  The  TLP  is  then  immediately  advanced  to  the  next 
position,  2.  Since  the  packet  there  specifies  address  16,  it  waits  until  the  CCD  pointer  - 16 
and  then  removes  the  packet  and  performs  the  memory  operation.  The  TLP  the  moves  to  3 
and  the  process  continues. 

The  removal  of  items  from  TL  makes  it  necessary  to  modify  the  rules  for  a 
well-formed  transaction  list.  If  16  is  removed  from  the  example  list,  the  line  from  cell  2 to 
item  25  passes  through  an  empty  cell,  which  would  violate  the  condition  given  previously. 
Therefore,  the  region  from  which  packsts  are  removed  la  declared  to  be  the  "removal  region", 
and  it  is  permissible  for  the  line  from  an  item’s  hash  address  to  the  item  itself  to  pass  through 
the  removal  region.  The  removal  region  is  delimited  at  its  counterclockwise  end  by  the 
'Yemoval  pointer”  RP,  and  at  is  clockwise  end  by  TLP.  After  removing  11  and  16,  the  example 
looks  like  this: 
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removal  region 
- [ RP,TLP  ) 


Whenever  an  item  it  removed,  RP  It  $et  to  the  hath  address  of  that  item.  In 
tfi*  iximpie,  after  23  is  removed,  RP  will  be  set  to  2 (25’a  hash  address),  and  TLP  will  be 
advanced  to  4. 

The  rules  for  a well-formed  transaction  iist  can  now  be  given  formally: 

(1)  V j,  k € TL  address  space,  if  j P k and  TL(j)  P empty  ><  TL(k), 

[ Ca(Tt<j))>  , J ] t [ «a(Tt(k))> , h ] 

(That  is,  the  interval  from  the  hash  address  of  an  item  to  the  item  itself  is  never 
contained  within  the  corresponding  interval  for  another  item,  i.  e.  the  lines  never  cross.) 

(2)  V j « [ RP , TLP  ),  TL(i)  » empty 

(That  is,  cells  in  the  removal  region  are  considered  to  be  empty.) 


(3)  V j,  k € TL  address  space,  if  TL(j)  - empty  P TL(k)  and  j £ [ RP  , TLP  ), 
j * [ «a(TL(k))»  , k ] 

(That  i?;  th#  Interval  from  hs;h  cddross  o?  or.  itsm  to  ths  Stsm  Itss!?  doss 
not  contain  any  empty  coils  not  in  the  removal  region.) 


(4)  V j,  k € TL  address  space,  if  f a(TL(j)»  - <a(TL(k))>  and  j € [ <£a(TL(k))>  , k 1, 
then  a(TL(k)>  £ a(TL(j)) 


f. 

>' 
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(That  is,  if  two  itams  have  the  same  hath  address,  the  more  clockwise  one  has  the  higher 
CCD  addrass,  i.e.  ail  tha  packets  having  one  hash  address  are  ordered  by  CCD  address.) 

(5)  V j,  k « 11  address  space,  if  j c [ Ce(TL(j))> , k ) and  a(TL(j»  - a(Tl{k)), 

than  V m c [ j , k 1 a(TL(m)) » a(TL{j)). 

(That  is,  all  itams  with  one  CCD  addrass  are  adjacent.  This  is  necessary  to  be  sure  that, 
whan  a sequence  of  adjacent  FET**)  packets  and  a CUR  are  found,  it  is  possible  to 
return  the  LOAD**)  packets  followed  by  a DONE,  with  no  danger  that  there  are  unseen 
packets  elsewhere  referring  to  the  same  CCD  address.) 

(6)  V j,  k c Tl  address  space,  if  j « [ <a(TL(j))» , k ) and  a(Tl(j))  - a(TL(k)), 

then  PJj)  was  placed  in  the  table  before  TL(k) 

(That  it.  the  itams  with  the  same  CCD  address  are  ordered  by  age,  the  youngest  being 
most  clockwise.)  This  property  makes  it  possible  to  return  a DONE  packet  as  soon  us 
a CLR  is  encountered  in  the  removal  scan,  since  the  packets  are  encountered  in  the 
same  order  as  they  were  originally  received. 

The  insertion  algorithm  requiras  some  care  when  passing  through  the  removal 
region.  If  the  scan  starts  outside  of  the  region  snd  then  enters  the  region,  the  item  is  pieced 
in  the  first  cell,  and  the  region  is  shortened  by  one  so  that  that  cell  is  no  longer  part  oi  the 
region.  If  the  scan  begins  in  the  region  but  not  in  its  first  cell,  the  scan  skips  over  the  region 
and  starts  after  its  end.  If  tho  scan  begins  in  the  first  cell  of  the  region,  it  skips  to  the  end  if 
its  CCD  address  is  greater  than  or  equal  to  that  of  the  item  just  past  the  end.  Otherwise,  it  is 
inserted  in  the  first  cell  and  the  region  is  shortened. 


removal  region 


To  insert: 

Oo  this: 

22-27 

put  at  3,  set  RP  :•  4 

30-33 

put  at  3,  set  RP  :■  4 

34*35 

put  at  6,  push  *he  36  and  43 

36-42 

put  at  7,  push  the  <*3 

43-77, 00-07 

pr.:  at  0 

The  algorithm  for  inserting  an  item  into  the  Tl  is  given  in  appendix  III  A.  If  th' 
TL  already  contains  an  UPD  packet  for  the  aame  address,  it  instead  performs  the  indicated 
action,  perhaps  modifying  the  UPD  packet  and  perhaps  transmitting  a packet  at  PESO. 

The  removal  algorithm  is  somewhat  simpler.  The  Tl  item  jointed  to  by  TLP  is 
next  to  be  removed.  The  C„D  pointer  Indicates  the  current  item  available  at  the  CCD  output. 
From  the  standpoint  of  the  algorithms  for  handli  tg  the  TL,  the  CCD  pointer  must  be  considered 
to  be  inexorably  advancing  under  control  of  an  external  agency.  The  external  agency  is  the 
clock  controlling  the  shifting  of  the  CCD  shift  register,  or,  in  the  case  o?  a magnetic  disk 
memory,  it  is  the  information  being  read  from  the  disk’s  timing  tracks. 

The  fact  that  the  CCD  pointer  is  synchronized  to  external  events  means  that  it 


cannot  bt  integrated  fully  Into  a system  using  the  packet  communication  principle.  It  must  be 
considered  external  to  the  packet  system,  and  some  synchronizers  or  arbitration  devices  must 
be  used  in  the  interface.  The  design  of  such  an  interface  is  a common  problem  of  digital 
system  design,  and  is  beyond  the  scope  of  this  thesis.  We  will  assume  that  the  interface 
between  the  synchronous  memory  device  and  the  packet  system  consists  of  ports  CCDI  and 
CCDO.  Every  time  the  CCD  advances  to  a new  address,  an  ADOR  packet  containing  that  cell's 
address  and  data  are  sent  to  the  system  through  port  CCDI.  If  the  system  fails  to 
acknowledge  the  ADOR  packets  fast  enough,  so  that  the  CCD  is  prevented  from  sending  one,  it 
may  either  drop  the  packet  or  wait  until  the  CCD  has  shifted  all  the  way  around  to  the  same 
address  again.  After  the  system  receives  an  ADDR  packet  at  CCDI  announcing  that  an  address 
has  been  reached,  it  may  transmit  a WRITE  picket  at  CCDO,  giving  the  address  and  new  data 
to  write.  If  this  packet  is  not  transmitted  soon  enough,  it  might  be  too  late  to  write  the  data 
into  the  CCOL  In  this  case,  the  CCD  shifts  all  the  way  around,  not  emitting  any  ADOR  packets, 
until  the  address  is  reached  again,  and  then  writes  the  data. 

Wasting  an  entire  rotation  time  whenever  the  asynchronous  part  of  the  system 
can’t  keep  up  with  the  CCD  clock  may  seem  drastic,  but  it  doesn’t  happen  very  often. 
Whenever  an  asynchronous  system  must  communicate  with  something  such  as  the  CCD  clock, 
there  is  the  possibility  that  it  may  be  late.  However,  it  is  not  difficult  to  design  the  system 
such  that  the  probability  of  this  heppening  is  vanishingly  small.  If  this  is  done,  it  is  possible 
to  prescribe  drastic  remedies  when  it  does  occur,  without  significantly  degrading  system 
performance. 

The  above  description  of  the  interface  to  the  CCD  may  be  somewhat  simple- 
minded.  Many  memory  devices  require  that  the  write  command,  and  the  data  to  be  written,  be 
given  oefore  the  previous  data  from  the  address  is  available.  This  means  that  the 
protocol  whereby  the  system  issues  a WRITE  packet  only  after  receiving  an  ADDR  packet 
bearing  the  data  might  not  be  appropriate.  In  the  case  of  a CCD  or  other  shift  register,  the 
problem  can  be  solved  by  having  two  "taps"  on  the  register:  one  for  reading,  and  another, 
one  or  two  bits  later,  for  writing.  In  the  case  of  a disk  memory,  the  problem  is  more  serious, 
end  may  require  th«»t  the  disk  announce  each  address  slightly  before  the  data  becomes 
available.  Ihe  necessary  modifications  to  the  asynchronous  part  of  the  system  will  not  be 
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treatcKf  here. 


The  rotating  memory  module  then  looks  like  this: 


CMDI  RESO 


The  removal  algorithm  waits  for  an  ADDR  packet  at  CCDI  matching  the  address 
contained  in  the  packet  in  the  transaction  list  pointed  to  by  TIP.  When  found,  it  performs  the 
indicated  transaction,  perhaps  sending  a packet  out  at  RESO.  It  then  sets  RP  to  the  hash 
address  of  the  item  which  was  just  processed,  which  may  shorten  the  removal  region.  The 
item  is  then  erased  from  the  transaction  list,  and  TIP  is  advanced  to  the  next  position.  If  TLP 
now  points  to  an  item  having  the  same  CCD  address,  that  item  is  processed  also,  using  the 
same  data.  All  transactions  giving  the  same  address  are  handled  in  this  way.  Any  reference 
count  changes  are  noted,  and  the  modified  reference  count  is  written  beck  into  memory  with  a 
WRITE  packet  at  CCDO. 

When  TLP  reaches  a cell  which  does  not  contain  a transaction  for  the  same 
address,  oither  it  is  for  a different  address  or  it  is  empty.  In  the  former  case,  the  sy  fern 
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waits  for  tho  CCD  to  roach  the  new  address.  In  the  latter  case,  it  sets  RP  « TIP,  destroying 
the  removal  region,  end  then  advances  both  RP  and  TIP,  in  step  with  the  ADOR  packets  that 
give  the  CCD  address,  untH  it  finds  a transaction  to  perform. 

The  algorithm  for  the  rotating  memory  is  given  in  appendix  III  B. 


v . S Hi 
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5.0  STRUCTURE  CONTROLLER  DESIGN  CONSIDERATIONS 

In  this  section  we  will  examine  a few  of  the  considerations  that  must  go  into 
the  design  of  an  efficient  structure  controller. 

5.0. 1  CHECKING  THAT  TIC  CONTROLLER  OBEYS  Fy^p 

The  structure  controller  never  issues  an  UPD  command  unless  the  reference 
count  is  Known  to  be  one.  Since  this  is  so,  there  can  be  no  transactions  pending  on  that  cell, 
so  the  requirements  of  f^Muscn  ar*  This  is  contingent,  of  course,  on  the  rest  of  the 
computer  correctly  realizing  fcoNTMLLERUSER  * A reference  count  violation  by  the  computer 
cov'd  lead  to  an  UPD  packet  being  sent  while  there  are  transactions  pending. 

5.0. 2  PRECISE  REFERENCE  ACCOUNTING  WITH  IMPRECISE  REFERENCE  COUNTS 

In  checking  that  f^,  satisfies  the  needs  of  the  structure  controller,  there  is  a 
point  of  possible  danger  that  needs  to  be  checked.  Since  L0AD(t)  packets  may  be  returned 
from  the  memory  in  an  order  different  from  that  of  the  FET(i>  packets,  it  was  shown  in 
section  3.0.2  that  the  reference  counts  returned  from  the  memory  may  be  unusual,  perhaps 
even  negative.  Is  it  possible  for  this  to  interfere  with  the  cell  management  mechanism?  The 
answer  is  no,  as  long  as  the  following  rule  is  obeyed: 

After  increasing  a reference  count  (with  a FET+),  do  not  pass  the  result  to  any 
destination  until  the  corresponding  LOAD*  has  returned. 

For  example,  if  an  instruction  cell  indicates  two  destinations  for  its  result,  the 
reference  count  of  the  result  must  be  increased  with  a FET+  before  the  result  is  sent  to  the 
destination  cells.  If  one  of  those  ceils  is  a SELECT  that  issues  a FET"  to  reduce  the  refe"ence 
count,  the  FET*  must  act  first.  Furthermore,  it  is  not  enough  to  rely  on  the  zero  latency 
arbitrator  to  be  sure  the  FET4  gets  to  the  memory  before  the  FET".  The  FET"  must  not  be 
sent  until  the  LOAD4,  arising  from  the  PETT  has  returned.  This  is  accomplished  by  not  sending 
the  result  to  the  destination  cells  until  the  LOAD4  has  been  received. 
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It  it  easy  toseethetnoceflwiMfelltobe  reclaimed  that  should  bo  reclaimed 
At  tho  timo  tho  lost  "owner"  of  • coll  Ittuot  t FIT'  to  discard  it,  thoro  art  no  Other 
oporotiom  pending  on  the  coH,  to  the  LOAD'  pecks*  thet  it  returned  will  have  the  correct 
reference  count,  which  it  zero 

To  too  thet  no  cell  wilt  be  accidentally  reclaimed  that  shouldn’t  be,  consider  a 
cell  with  reference  count  2,  owned  by  instruction  cells  X end  Y.  Suppose  X performs  a 
structure  operation  that  discards  its  copy,  so  that  a FIT  is  Itaued.  We  must  show  that  if  Y 
does  ngt  discard  its  copy,  the  LOAD'  thet  arises  from  X’t  operation  will  not  have  reference 
count  aero.  Tho  only  way  tho  reference  count  could  possibly  |0  to  aero  is  if  Y also  causes  a 
FET".  Since  Y does  not  intend  to  discard  its  copy  of  the  coH,  a FIT*  must  have  been  issued 
first.  (That  it,  the  reference  count  should  actually  to  up  to  3,  then  down  to  2 and  then  1.) 

The  memory  receives  the  following  sequence  at  CMDIt 

FET'(addr,X)  » FCT*(addr,Y)  i FEHaddr,  Y) 

Tho  situation  to  be  avoided  is  that  In  which  the  second  and  third  LOAD  packets  are  reversed: 

LQAO'feddr,-,  1,  X)  | LOAD'(addr,~,  0,  Y)  j tOAD*(addr,~,  l,  Y) 

This  can’t  happen,  because  the  FfT~(eddr,  Y)  ft  not  sent  until  the  lOAO*(addr,~,--,Y)  has  been 
returned. 

5.0.3  MEMORY  LATENCY 

MM’S  latency  was  left  unspecified  only  for  the  purpose  of  proving  correctness 
of  MM  and  its  user.  When  actually  implement  ng  a puctico!  pa:'  s'.  (-»  ->ory,  it  may  be 
necessary  to  build  a high  degree  o?  *<  (on cy  into  to^e  vu  dul* t t s order  ♦<?  obtain  good 
performance.  For  example,  a "rotating*  implementation  of  MM  using  a charge  coupled  shift 
register  may  be  designed  to  have  hunch'd*  or  thousands  of  commands  pending  at  one  time, 
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although  its  correctness  does  not  depend  on  this. 

5A4  THROUGHPUT  AND  DISTRIBUTED  PROCESSING 

One  of  the  fundamental  principles  of  data  flow  computers  is  that,  if  enough 
parallelism  exists  in  the  program,  a computer  be  able  to  run  arbitrarily  fast  for  a given  logic 
speed.  To  do  this,  it  must  distribute  the  computation  and  ba  free  of  bottlenecks.  If  a data 
flow  computer  could  only  have  one  multiply  unit,  that  would  be  a bottleneck,  since  it  would 
limit  the  rate  at  which  multiplies  could  be  performed.  The  data  flow  concept  must  not  place 
any  restrictions  at  all  on  the  number  of  multipliers  that  a computer  can  have  (although  any 
given  computer  of  course  has  a fixed  number1).  Thera  must  not  even  be  bottlenecks  in  ports 
through  wNch  packets  must  pass.  If  every  multiply  operation  packet  had  to  pass  through  one 
input  port  of  an  allocator  on  its  way  to  the  multipliers,  that  would  be  unacceptable,  since  the 
logic  speed  places  a limit  on  the  rata  at  which  packets  can  pass  through  a port.  For  example, 
if  a port  could  handle  packets  100  times  faster  than  a multiplier  could  process  them  and  all 
packets  had  to  pass  through  one  port,  it  would  mean  that  no  more  than  100  multipliers  could 
be  usefully  employed. 

In  the  case  ii  simple  functional  units  such  as  multipliers,  it  is  not  difficult  to 
avoid  bottlenecks  functional  units  may  bo  used,  and  the  arbitration  and  distribution 

networks  that  connect  them  to  t.ie  instruction  colls  may  be  designed  to  be  free  of  bottlenecks 
and  thus  maintain  any  desired  throughput  rate  [5] . For  the  same  reason,  multiple  structure 
controllers  are  used,  each  with  its  own  ports  connected  to  the  arbitration  and  distribution 
networks  of  the  data  flow  computer,  Also,  multiple  memory  units  are  used,  because  the  total 
memory  transaction  rate  is  greater  than  can  pass  through  a single  pair  of  CMDI/RESO  ports. 

It  is  not  possible  to  compartmentalize  the  structure  operation  facilities  as  can 
be  done  with  simple  functional  units.  Connecting  each  structure  controller  to  one  memory 
module  is  not  correct,  because  each  structure  controller  must  have  access  to  the  entire 
memory  address  space.  The  structure  controllers  must  be  connected  !o  the  memories  through 
•n  Interconnection  network  cons’iting  of  arbitrators  and  distributors  for  packets  going  in  each 
direction.  Command  packets  from  the  structure  controllers  have  part  of  the  address  field 


removed  and  used  to  ootoct  the  output  port  of  the  distributor,  just  as  was  done  for  the 
multiple  memory  connection  in  section  3.1.  in  this  wsy,  each  structure  controller  "sees"  the 
full  address  space,  while  each  memory  module  supports  only  a small  part  of  the  total  address 
space.  The  command  packets  from  the  different  structure  controllers  are  merged  in 
arbitrators,  which  append  the  incoming  port  number  to  the  tag  field,  so  that  the  result  packet 
wfN  be  returned  to  the  correct  controller.  Packets  coming  out  of  tho  RESO  ports  of  tho 
memory  modules  pass  through  distributors  that  uaa  tho  addod  bits  of  tho  tag  field,  and 
arbitrators  that  urn  the  incoming  port  number  to  reconstruct  the  fuH  address. 

imtrconntciKm  n§twocft 


A2  inserts 
input  port 
into  address 


02  removes  and 
uses  part  of 
tsg  to  select 
output  port. 


The  treatment  of  address  fields  and  tag  fields  is  symmetrical.  One  could  think 
nf  ail  pending  structure  operations  as  occupying  a "tag  space".  Just  as  each  memory  module 
supports  a smali  part  of  the  total  address  space,  arch  structure  controller  supports  a small 
part  of  the  total  tag  space.  Th*  job  of  the  interconnection  network  is  to  make  the  entire 
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•ddross  space  available  to  each  structure  controller,  and  to  make  the  entire  tag  space 
available  to  each  memory  unit. 

It  is  not  necessary  for  the  network  to  place  the  distributors  befc/o  the 
arbitrators.  Such  a network  would  have  a size  proportional  to  the  product  of  the  number  of 
structure  controllers  and  the  number  of  memory  units,  which  may  be  axcestive.  It  $s  possible 
to  mix  arbitrators  and  distributors  in  a network  in  such  a way  that  the  size  is  reasonable  but 
bottlenecks  are  avoided 

Because  UPD  packets  do  not  have  a tag  field  and  do  not  give  rise  to  result 
packets  at  RESO,  it  is  necessary  that  the  arbitrators  and  distributors  carrying  packets  from 
lie  structure  controllers  to  the  memory  modules  (those  labelled  A1  and  01  in  the  preceding 
diagram)  have  latency  zero.  This  is  so  that,  whan  a structure  controller  receives  an 
acknowledge  for  an  UPD  packet,  it  will  be  guaranteed  that  the  packet  has  passed  through  the 
arbitrator  and  is  therafore  ahead  of  any  packet  that  may  subsequently  be  introduced  into 
another  input  of  the  arbitrator.  Suppose  this  were  not  done:  One  structure  controller  might 
write  on  a cell,  thereby  completing  the  creation  of  a structure.  When  it  receives  an 
acknowledge  for  that  UPD  command,  it  assumes  that  the  structure  is  complete,  and  so  it 
returns  it  to  the  rest  of  the  computer.  An  instruction  cell  in  the  computer,  having  received 
this  structure,  may  fire,  causing  a SELECT  operation  to  be  generated.  The  allocator  may  send 
the  SELECT  operation  packet  to  another  structure  controller,  which  than  sands  out  a FET 
packet  with  the  same  address.  If  there  is  buffering  before  the  arbitrator  that  marges  packets 
from  the  two  structure  controllers,  the  original  UPD  packet  might  still  be  in  such  a buffer,  so 
the  FET  packet  passes  through  the  arbitrator  first.  If  this  happens,  the  old  data  will  be  read, 
rather  than  the  new  data  supplied  by  the  UPD  packet.  By  making  sure  ihat  the  distributor 
and  arbitrator  have  latency  zero,  the  UPD  packet  cannot  get  stuck  in  a buffer.  When  the  first 
structure  controller  receives  an  acknowledge  for  the  UPD  packet,  that  packet  is  known  to 
have  bean  accepted  by  the  arbitrator,  and  hence  it  wili  precede  any  subsequent  FET  packet. 

It  it  is  not  feasible  for  the  interconnection  network  to  use  distributors  and 
arbitrators  that  have  no  memory,  it  is  rwcessery  to  put  tag  fields  in  all  UPD  specification 
passing  through  the  network.  An  "adapter  unit"  is  placed  between  the  ne'work  and  each 
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memory  modulo.  The  adaptor  pastes  alt  packets  through  except  UPD  packets.  Whan  it 
receives  UPtXsddr,  data,  ref,  tag),  it  sends  UPDtaddr,  data,  ref)  to  the  memory  and  UACK(tag) 
back  to  the  interconnection  network.  The  structure  controller  does  not  return  a structure  to 
the  rest  of  the  computer  until  it  has  received  LiACK  replies  for  all  UPO  commands  that  it  has 
sent.  Whether  such  UACX  packets  are  required  is  a question  of  the  design  of  efficient  routing 
networks  end  is  beyond  the  scope  of  this  thesis. 

5.0.5  THE  FREE  STORAGE  LISTS 

To  maintain  just  one  free  storage  list  would  create  a bottleneck,  so  each 
structure  controller  has  one.  Whenever  a structure  controller  needs  a word  in  order  to 
create  a node,  It  takes  its  address  from  the  packet  presented  at  input  port  UIDI.  (UID  stands 
for  unique  identifier.)  The  structure  controller  does  not  ask  for  addresses  at  UIDlt  they  are 
supplied  In  an  "unending"  stream,  aa  fast  as  they  are  acknowledged. 

The  sources  of  the  streams  at  UIDI  are  rlso  the  structure  controllers,  each  o< 
which  maintains  a free  storage  list  and  sends  out  addressee  through  output  port  UIOQ.  Tv 
UIDO  ports  are  connected  to  the  UIDI  ports  through  a collection  of  allocators  and  arbitrators 
called  the  IJID  network.  The  purpose  of  this  network  is  to  maintain  a supply  of  free  cells  to 
all  controllers,  even  if  some  controllers'  free  storage  lists  should  run  out. 
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Each  structure  controller,  in  addition  to  performing  structure  operation*, 
maintain*  a free  storage  list.  Whenever  an  acknowledge  is  received  on  UIDO,  it  takes  a cell 
from  the  list  and  transmits  it. in  a UIO  packet  through  UIOQ.  Since  a reference  count  scheme  is 
used  for  recovering  unused  cells,  the  controller  watches  for  words  whose  reference  counts  go 
to  zero.  Every  time  it  reduces  a reference  count  by  Issuing  a FET~  command,  it  examines  the 
LOAD*  packet  that  Is  returned  If  it  shows  a raference  count  of  zero,  the  word  is  reclaimed 
This  involves  placing  the  word  in  the  free  storage  list  and,  since  whatever  pointers  it 
contained  are  destroyed  reducing  their  reference  counts  if  their  elem  bits  are  off.  If  either 
or  both  of  the  latter  reference  counts  go  to  zero,  those  words  are  reclaimed  by  the  same 
process. 


The  procedure  is  recursive,  snd  is  an  unpleasant  type  of  recursion  because  the 
completion  of  each  operation  can  produce  two  more  operations  to  perform.  Although  the 
recursion  siways  terminates,  a huge  amount  of  storage  may  be  required  to  hold  the  list  of 
words  that  need  to  have  their  reference  counts  reduced.  The  problem  at  its  worst  can  be 
observed  in  the  case  of  a large  tree,  no  subtree  of  which  is  shared  with  anything  else,  whose 
root  node  is  discarded  Ail  nodes  have  an  initial  reference  count  of  1,  so,  when  each  node  has 
its  count  reduced,  it  goes  to  zero,  making  it  necessary  to  reduce  the  counts  of  both  of  that 
node’s  offspring. 


To  implement  this  procedure  by  simply  issuing  two  FET*  packets  whenever  a 
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word’s  reference  count  goo*  to  zero  (that  to,  whanavar  a LOAD”  it  racaivad  baaring  a count 
of  zaro),  would  craata  an  intractable  deadlock  problem  bacausa  of  tha  prolifer  ation  of  packats. 
Instead,  tha  procadura  that  should  ba  used  is  that  only  tha  right  offspring  of  a word  should 
ba  traatad  at  tha  time  tha  word  It  placed  on  tha  free  storage  list  Tha  pointer  to  tha  left 
offspring  wHt  remain  in  tha  ward  while  It  is  on  tha  free  storage  list.  Tha  recursion  in  this 
procadura  is  under  control,  since  only  one  new  operation  is  created  for  every  operation  that 
is  completed  Whan  a word  is  taken  from  tha  free  storage  list,  tha  reference  count  of  its  left 
offspring  is  reduced,  which  may  causa  one  or  more  words  to  ba  reclaimed,  before  tha  word  is 
used 


Tha  memory  management  algorithm  is  as  follows: 

(1)  Whenever  a word's  reference  count  is  reduced,  examine  the  LOAO"  packet 
that  is  returned.  If  it  shows  a count  of  aero,  put  the  word  on  the  free 
storage  list  end,  if  the  elembit  in  its  right  half  is  aero,  reduce  the  reference 
count  of  the  word  pointed  to  by  that  half,  this  may  cause  this  step  to  be 

repeated 

(2)  Whenever  an  acknowledge  is  received  from  port  U1D0,  get  a word  from  the 
free  storage  list  and  send  the  packet  UUXaddr,  its  left  half)  through  UIDOi 
(The  contents  of  the  left  half  are  sent  simply  to  avoid  an  extra  memory 
reference.) 

(3)  Whenever  a fresh  ceH  is  needed  for  creation  of  a structure  node,  take  the 
packet  UlOfaddr,  obj)  at  port  UIDI  and  acknowledge  eerne.  Addr  is  the 
address  of  new  cell  if  tha  slam  bit  of  ob[  is  off,  red'^ce  fh?  reference 
count  of  the  addressed  word  This  may  causa  step  (i)  to  be  invoked 
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5.0.6  MAINTAIN'NG  INTEGRITY  OF  THE  REFERENCE  ACCOUNTING  MECHANISM 

The  possibility  of  an  orror  in  tho  roforonct  accounting  and  coil  management 
meehenism  it  e troubieaome  problem,  because,  at  explained  in  section  2.1.1,  it  is  impossible 
for  the  memory  to  detect  a reference  accounting  error  by  its  user.  Furthermore,  the  effects 
of  such  an  error  are  unpredictable,  end  may  show  up  in  completely  unrelated  parts  of  the 
computation.  However,  there  are  a few  things  that  can  be  done  to  minimize  the  probability  of 
such  an  error  being  undetected. 

First,  all  nils  on  the  free  storage  list  can  be  marked  in  some  way,  perhaps  by  a 
bit  reserved  for  this  purpose.  Any  reference  to  a marked  celt  other  than  for  the  purpose  of 
removing  It  from  the  free  storage  list  is  a detectable  error.  Alsu,  the  free  storage  list  can  be 
organized  In  such  e way  that  cells  are  added  at  one  end  and  removed  from  the  other,  thereby 
maximizing  the  time  that  a cell  stays  on  the  list  once  it  is  put  there.  If  a ceil  is  erroneously 
reclaimed  while  a "spurious"  pointer  to  it  exists,  it  will  then  probably  still  be  on  the  free 
storage  list  when  the  spurious  pointer  is  used,  so  the  error  can  be  detected. 

Another  way  of  checking  integrity  of  reference  counts  is  to  conduct  sn  "audit" 
of  the  entire  computer.  This  can  be  done  at  the  and  of  the  computation,  and  »l  any  point 
during  the  computation.  The  host  computer  must  disable  all  instruction  cells  and  wait  for  all 
pending  operations  to  clear  out  of  the  structure  controllers  and  the  routing  networks.  All 
reference  counts  can  then  be  checked  against  the  contents  of  the  input  registers  of  the 
instruction  cells. 
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6.0  THE  DEADLOCK  PROBLEM 

The  structure  controller  snd  cschs  module  that  wtr*  described  previously  were 
both  required  to  heve  e large  capacity  for  stste  information  which  would  be  unnecessary  if 
one  could  always  be  sure  that  the  device  lower  in  the  hierarchy  would  accept  e command 

In  the  case  of  the  structure  controller,  the  general  behavior  upon  receiving  a 
result  packet  from  the  memory  <s  to  perform  some  transformation  on  the  dtta  in  its  state 
memory  end  then  send  a new  command  packet.  Its  internal  state  memory  could  be  dispensed 
with,  and  the  state  information  placed  directly  into  the  tag  fields  of  the  packets.  When  a 
result  packet  is  received  from  the  memory,  a "memoryless*  controller’s  functions  would  then 
be  simply  to  perform  a transformation  on  the  packet  itself,  forming  e new  packet  which  is  sent 
ft  ‘v«  memnry  The  reason  this  fails  is  that  one  can’t  be  sure  the  memory  won’t  decide  to 
return  several  result  packets  (perhaps  all  pending  ones)  before  It  accepts  any  more  command 
packets.  Suppose  this  happened  to  a memoryless  structure  controller.  It  would  have  no 
place  to  put  the  result  packets  If  the  memory  unit  isn’t  accepting  any  more  commands,  so  a 
deadlock  would  occur.  The  problem  is  that  the  controller  has  violated  the  rule  that  it  must 
always  be  prepared  to  accept  the  results  of  all  pending  operations.  A structure  controller 
having  state  memory  avoids  this  problem  by  always  having  space  to  absorb  the  results  of  ell 
pending  operations. 

A similar  problem  arises  in  the  cache  module.  If  a word  is  not  in  the  csche  and 
a FET***  packet  is  received,  a cell  is  immediately  allocated  for  it  and  placed  in  state  P.  A 
FET<±)  packet  is  also  sent  to  main  memory  to  fetch  the  data.  Until  the  data  returns  from  the 
memory,  the  cell  in  the  cache  does  not  have  date  in  it,  so  it  serves  no  useful  purpose.  It 
might  seem  to  make  more  sense  to  allocate  the  cache  ceil  only  when  the  first  L0AD(±>  packet 
is  received  from  the  memory  rather  than  when  the  first  FET(±>  packet  is  received  from  the 
user  - that  is,  to  bypass  state  P altogether.  The  problem  is  that  the  creation  of  a cell  in  the 
cache  may  require  writing  out  the  cell’s  former  contents.  If  the  cell  is  created  in  consequence 
of  the  L0AD<±>  packet  coming  from  memory,  the  cache  may  have  to  send  a packet  to  memory 
in  response  to  a packet  from  memory.  If  the  memory  sends  such  L0AD(t;  packets  but  does 
not  accept  any  replies,  the  cache  would  have  no  place  to  put  the  data,  so  a deadlock  would 


rt w-w-wonr* 


105 

occur.  The  cache  implementation  given  in  taction  3.2  avoids  this  problem  by  reserving  space 
for  the  LOAD***  packet  in  advance.  If  an  UPO  packet  must  he  sent  to  the  memory,  it  is  done 
in  response  to  input  from  the  user  rather  than  from  the  memory.  This  way,  if  the  memory 
temporarily  refuses  to  accept  the  UPO,  the  cache  can  simply  refute  to  accept  input  from  its 
user. 


In  both  the  structure  controller  and  the  cache,  the  cost  incurred  as  a result  of 
this  problem  is  an  amount  of  memr  y equal  to  ail  the  ^packets  that  can  be  simultaneously 
pending  in  ell  lower  levels.  In  the  controller,  this  it  the  state  information  for  all  concurrently 
executing  structure  operations.  In  the  cache,  a crll  might  be  in  state  P for  every 
PET<*VLOAD'*>  cycle  that  it  pending  at  that  instant,  Since  a oil  in  state  P is  useless,  the 
cache  must  be  that  much  larger  than  it  otherwise  would  be,  for  a given  level  of  performance. 

In  the  case  of  *hs  structure  controller,  the  memory  space  is  needed  somewhere 
in  eny  case.  If  a great  number  of  memory  transactions  can  be  pending  simultaneously,  a 
"rotating"  memory,  such  as  was  described  in  section  4.0,  is  presumably  being  used.  If  a 
memoryless  stricture  controller  is  used,  the  state  information  for  pending  operations  is  stored 
in  the  tag  fields  instead  of  the  controller.  But  the  tags  of  pending  memory  operations  must  be 
stored  in  the  transaction  list  of  the  rotating  memory,  so  whatever  space  was  saved  in  t!ie 
controller  is  used  up  in  the  transaction  list. 

Why,  then,  would  a memory!***  structure  controller  be  more  desireble?  The 
reason  is  that  memory  space  inside  tha  controller  is  much  more  expensive  than  in  the 
transection  list.  The  controller  must  be  able  to  process  information  as  fast  as  the  highest 
level  of  the  memory  hierarchy.  If  that  highest  level  is  a cache  using  high  speed  (and 
expensive)  devices,  the  controller  must  be  equally  fast.  The  rotating  memory  is  at  the  bottom 
of  the  hierarchy,  so  its  transaction  list  can  use  a slower  and  less  expensive  logic  family. 

In  order  to  use  a memoryless  structure  controller  or  a cache  which  does  not 
use  "P"  cells,  the  memory  system  below  the  controller  or  the  cache  must  obey  the  following 
"fixed  latency  law": 
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Whenever  a result  packet  ia  transmitted  at  RESO,  the  device  must  accept  a 
packet  at  CMDl.  If  that  packet  is  an  UFO,  it  must  accept  yet  another,  until  it 
has  taken  one  that  is  not  UPO.  It  must  do  this  even  if  the  user  does  not  accept 
anything  further  at  RESO. 

Yhe  reason  UFO  packets  are  a special  case  is  that  they  do  not  generate  any  result,  so  the 
system  should  be  able  to  absorb  them  in  unlimited  numbers. 
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Some  memory  systems  obey  this  law,  A random  access  Implementation  of  MM 
clearly  does.  A rotating  implementation  can  also,  since  the  transaction  list  has  fixed  size. 
Whenever  an  item  is  taken  out  of  the  TL,  another  can  be  inserted.  (The  implementation  of  the 
rotating  memory  given  in  section  40  did  not  always  behave  this  way,  but  it  could  easily  be 
modified  to  do  so.) 

The  systems  that  do  not  obey  the  fixed  latency  law  ere  the  horizontal 
composition  of  MM  units  and  the  cache.  The  former  includes  the  interconnection  network 
between  the  structure  controllers  and  the  memory  units.  In  the  case  of  the  horizontal 
interconnection  of  units  each  of  which  obeys  the  fixed  latency  law,  when  one  unit  transmits  a 
result  packet,  It  will  accept  a new  command  That  result  packet  passes  through  the  arbitrator 
end  becomes  e result  of  the  interconnection,  so  the  Interconnection  must  accept  another 
command  If  the  command  is  addressed  to  a different  MM  unit  than  the  one  that  transmitted 
the  result,  that  unit  might  not  be  able  to  accept  it.  What  is  needed  is  a way  for  the  units  to 
share  the  burden  of  pending  transactions  with  each  other. 

In  the  case  of  the  cache,  maintaining  a constant  number  of  pending  transactions 
in  the  cache  and  memory  combined  requires  maintaining  a constant  number  of  pending 
transactions  in  the  memory  atone.  For  every  result  packet  transmitted  by  main  memory, 
another  command  must  go  from  the  cache  to  main  memory.  However,  such  commands  only 
occur  when  there  are  cache  misses.  If  the  cache  runs  into  unusually  good  luck  and  gets  a 
continuous  string  of  cache  hits,  it  would  not  send  commands  to  memory.  In  order  to  maintain 
constant  latency,  it  would  have  to  refuse  any  result  packets  from  memory.  This  could  result 
in  some  transactions  remaining  pending  indefinitely.  While  this  probably  wonl  cause  a data 
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(low  computer  to  malfunction,  it  might  bo  an  undoairabio  affect  in  general. 

Theta  cHfficuitia*  can  probably  bo  ovorcoma  through  tho  addition  of  oxtro 
circuitry  to  bo  turo  that  thoro  it  aiwsya  apaco  to  handlo  aN  pacKata.  It  ia  not  cloar  whothor 
tho  benefited  • memoryteaa  atructura  controller  and  a cache  without  atato  "P"  Justify  ouch 

meaeurea. 
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7.0  SUGGESTIONS  FOR  FURTHER  RESEARCH 

On*  of  tft*  principJ  problems  remaining  in  the  art*  of  the  design  of  systems 
using  the  packet  eommuntcettofl  principle  is  the  development  of  « practical  end  systematic 
procedure  for  constructing  modules  that  can  be  proven  to  meet  given  functional  specifications. 
An  important  tool  for  this  task  Is  the  development  of  a rigorous  and  concise  Architecture 
Description  language  (AOU  With  the  hsl;j  of  the  ADL,  the  task  can  be  divided  into  two  parts: 

(1)  Development  of  a proof  methodology  so  that  systems  expressed  in  the  ADI 
can  be  proven  to  meet  functional  specifications. 

(2)  Development  of  e system  construction  methodology  so  that  systems 
expreseed  in  the  AOL  can  be  constructed  with  confidence  that  the  physical 
device  wMI  realize  the  ADI  expression. 

For  this  purposef  the  MX.  must  be  simple  enough  to  correspond  neatly  to  the 
hardware  device*  involved,  but  powerful  enough  to  msk*  proofs  involving  history  arrays 
tractable. 

Another  remaining  problem  Is,  of  course,  to  develop  functional  specifications  for 
oil  ports  of  the  data  flew  computer  system,  including  the  structure  controller,  and  give  proofs 
of  their  correctness.  The  functional  specification  of  the  computer  Itself  (that  Is,  the  structure 
controller’s  user)  i*  needed,  among  other  things,  to  show  that  no  reference  count  violations 
will  occur. 

An  efficient  structure  controller  needs  to  be  designed,  with  special  attention  to 
the  needs  of  programs  that  are  likely  to  arise. 

The  deadlock  problem  needs  to  be  examined  carefully,  to  see  if  it  is  worthwhile 
to  build  a memoryless  structure  controller. 
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Proof  that  the  concatenation  of  two  FIFO  buffer*  it  • FIFO  buffer,  and  length*  are  additive. 


proof  is  given  not 
method  of  proving 
H. 


the  statement  it  of  fundamental  interest,  but  ee  an  example  of 
t about  the  behavior  of  systems,  showing  acknowledgments  In 


Let  a FIFO  of  sice  M have  input  port  X and  output  port  t 
let  another  FIFO  of  tic*  N have  input  port  Z and  output  port  Y, 

end  let  the  ports  Z and  the  acknowledge  ports  ZA  be  linked 


From  the  definition  of  the  first  FIFO, 


(1)  |Z}  • min  { |X| , |ZA|  ♦ 1 } 

(2)  Z, « X, 

13)  |X^|  • min  { |X| , |Z^|  ♦ M } 


From  the  definition  of  the  second  FIFO, 


W)|Y|-min{|Z|,|YA|*l} 

(3)  Y,  • Z, 


Cats  !:  Suppota  |X|  £ |YA|  ♦ N 
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By  tha  atrong  term  cf  tha  Standard  Adwowtadga  ftattrtetten, 
olthar  |Zj  ■ |ZA|  or  |Z|.|ZA|*l 
If  W-|ZAl*i,than 

Pal-iy+N  (from  ^ tinea  fly  0 |Z|) 

Kim  (from  1) 

)ZA!  < iX| 

|YA|  ♦ N < p(| , which  it  a contradiction,  to  wa  mutt  hava  |Z|  • |ZA| 
|Z|-M  (from  1,  tinea  ffl  0 |ZAj  ♦ 1) 

|Y|  • min  { p(| , |YA|  ♦ 1 } (from  4) 

P<A|  - min  { (X) , p<|  ♦ M } (from  3) 


•*•  PtAl  • P<l  (tinea  M£0) 

M <;  |*A|  ♦ M ♦ N (by  hypothatit  and  fact  that  M £ 0) 

PCA|  • min  {PCI . |YA|  ♦ M ♦ N } 


Cata  II:  Suppota  |X|  > |YA|  + N 
If  (Z|  ■ |ZA| , than 

|Z|  • PC|  (from  1,  tinea  |Z|  0 |ZA|  ♦ 1) 

|Z1  ^ |YA|  ♦ N (from  6) 

W S PA1  ♦ H which  It  a contradiction,  to  wa  mutt  hava  |Z|  - |ZA|  ♦ 1 
|ZA|  • |Ya|  ♦ N (from  6,  tinea  |ZA|  0 |Z|) 

|Z|  - |YA|  ♦ N ♦ 1 


*.  |YaI  ♦ 1 S |Z| 

•**  M-rg  + i 

IZ|  £ PC! 

•*.  |vAl ♦ 1 Si*i 

•**  |V|  - min  { pC| , |YA|  ♦ 1 > 
P<AI  • min  { p(| , |YA|  ♦ M ♦ N } 


(tinea  N £ 0) 
(from  4) 
(from  1) 


(from  3 and  |ZA|  « |YA|  ♦ N) 


In  aithar  cata, 

|Y|  - min  { pC|,|YA|  ♦ 1} 
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APPENDIX  11 

Algorithm  for  tho  ;acbo. 

A itual  lookup  in  ihe  cache  la  not  shown.  Instead,  tho  special  functions 
cicniniviiiMKr^  cKfH  rmtmarh  cvcni*tvi(9\vNrj||  me  cicnvinooimor/  nr»  ustcL  i nvit 
art  traatad  **  though  thay  wars  arrays,  and  ara  assumed  to  ba  defined  whanavar  tho  given 
address  exists  in  the  cache.  In-cachoteddt)  returns  tryf.  if  the  given  address  exists  in  the 
cache. 

Can-craatataddr).  where  addr  does  not  exist  in  the  cache,  tails  whether  it  can 
be  created,  that  is,  whether  soma  cad  in  its  column  is  unused  or  is  in  state  T. 

If  can-cr*ate(addr)  is  true,  creation-colHs-empty(addr)  tails  whether  the 
former  caae  holds,  and;  if  so,  cacha-craatataddr)  performs  the  insertion  into  an  unused  cell. 
Otherwise,  coH-to-dbo<eco(addr)  returns  tho  address  of  a cell  in  state  T,  selecting  the  least 
recently  used  Item.  Cachofonamotold.  new)  performs  the  raplaeemwvt. 

processes  start  at  0.  A 
input  ports  Od.  MEM! 

Output  ports  PESO.  MEMO 

ver  cmd,  item,  addr,  data,  ref,  oid-addr,  p 

var  m init  falsa  | tads  whether  to  wait  for  input  from  MEM! 

var  memof lag  intt  jrj&  | true  when  last  packet  sent  at  MEMO  has  bean  acknowledged 

var  memowait  init  falsa  | true  whan  need  to  sand  something  on  MEMO 

var  wait-pkt  | the  thing  to  sand 

var  create-f lag  {nit  false  | true  whan  need  to  create  a new  cache  call 
var  craata-pkt  | command  that  lad  to  creation 
var  new-addr  | address  field  of  craata-pkt 
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Qi 

w«H  for  acknowledge  on  port  MEMO ^ 
toko  tho  acknowledges 
momoflag  truoi 
IfiiSL  0 


5 

f 

| 

f 

| 


r 

f 

i 

l 


At 

yntibnemoflag  ££,  packet  it  tvaiiablo  on  port  MEM  doj 
m *"  IlSlB  I bocowtt  trut  If  should  take  packet  at  MEM 
it  mamofla*  than  | it  memory  rttdy  for  command? 

jf  towo-coW-Mn-atatt-Q-or-Q’  ttgn  | too  if  nood  to  tond  a CUf 
•ddr ad*ets-of-a-eelHn-atate-Q-or-Qt 
Rwmoflag faitoi 
and  CUKaddr)  on  port  MEMO. 
ijcacho-ttato(addr)  - "0*  than  | ehenge  Q to  R,  Q*  to  R* 
cacha-atatafaddr)  t-  TT 

ass. 

cache-atate(addr) "R*  • 

fteo  if  memowait  than  | taa  if  naad  to  aand  FET^  aftar  craating  a ceil 
memo  wait  :«  falaat 
mamofla* false; 
wnd  wait-pkt  on  gort  MEMO 

SlSi  11  create-fiag  than  | taa  if  trying  to  craata  a coll 

if  cgn-create(new-addr)  than  | it  toma  call  in  itt  column  ampty  or  in  ttate  T? 
croata-fla*  :«  falsa  | yet,  will  create  the  cell 
II.  craation-caH-it-amijtv(naw-addr)  than 

cacha^raatetnaw-addr)  | oki  call  ampty,  just  put  in  now  addrast 
else 


1 


■i 


m 


oW-addr  t»  ceii-to-diopiac*(now~oddr)  | Hod  cod  to  dispte* 
jf  cache-mod(oW-addr)  then 

muiofiag faia*  | writ*  out  previous  content*  if  nacoaaary 


aond  UKXokf-addr,  cacbo-data(ekh«ddr),  cacho-rof(oki-**ddr))  on  port  MEMOi 


>(oW-*ddr,  n*w«*ddr)i 


| create  th*  now  c*fl 


| th*  n*w  cacti*  c*t1  now  *xi*t* 


It  cr*«t*~pM  - UPO(",— ,~)  jhjn  | what  command  c*ua*d  th*  creation? 

jfi  create-pkt  « UPO(~,  data,  r*f)i  | UPty  fM  in  now  c*fl  appropriately 
ceche-mocKnew-addr)  t»  true; 
e*ch*-data(n*w-addr)  i-  data; 
cache-reffnew-addr)  r»  rail 
cach*-at«t*(n*w*ddr)  :■  T 
*tf  I command  waa  FET*** 

cach*-mod(n*w-addr) fatam 
cache-atatefnew-addr)  t* 

wait-pkt  !■  create-pkti  | queue  command  for  tranamiaaion  to  memory 
mo  mow  ait true 

t!*JL 

m trvio  | canl  croata  now  each*  ceH,  moat  wait 

oio* 

wait  for  packet  on  MEM  or  CM3!,  lot  P - that  port} 
if  p - ,CMDr  then 

| ♦♦♦♦♦  proceas  packet  from  CMC!  ♦♦♦♦♦ 

cmd  RCVPKT(CMDI); 
if  emo  - FET<*><“,— ) then 
jet  cmd  * FET^faddr,  tag); 
if  In-cache(addr)  then 

if  cache -atatafaddr)  - V*  then 
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mamoftag  faltat  | atata  P,  juat  tand  it  onward 
«ond  cmd  on  port  MEMO 
afta  I atata  it  R or  I 

If  cmd  - FET*(—.-»)  than  | naad  to  updata  raiaranca  court? 
cacha-raf(addr)  e eacha-rtRadd r)  ♦ It 
eacha-modtaddr) trgai 

XMTPKT(RESO) » L0A0+(addr,  cacha-data<*ddrX  caeha-faRaddrX  tag) 
ota*  if  cmd  - FETl-,-)  than 

cacha-faf(addr)  :•  cacha-raf(addr)  - It 
eaeha-mod(addr)  :•  truat 

XMTPKT(RESO)  i-  LOAD'laddr,  crcha-datafaddr),  cacha-raf(addr),  tag) 

sin 

XMTPKT(RESO)  f LOftDUddr,  cacha-data(addrX  eaeha-raRaddrX  tag) 
*tao  | atata  N 

naw-siidr  addrj  | sot  fta$a  «0  colt  will  b*  craatad 
craata-pkt  t»  cmdt 
craata-flag |rua 
tJaj.if  e?s*d  » UP0(*-,-v*)  WS2. 

Sat  cmd  - UPOfaddr,  data,  rafX 
II  ln-cacho(addr)  than  | muat  ba  atata  R or  T 
cacha-data(addr)  :■  data* 
cacha-raRaddr)  *•  raft 
cacha-tnocKaddr) trua 
olao  | atata  N 

naw-addr  :•  addri  j tat  flaga  ao  call  will  ba  craatad 
craata-pkt  :•  cmdt 
craata-flag :•  trua 
ataa  | mutt  ba  &R 

iat  cmd  * CLR(addr)i 
jf  in-cacha(addr)  | atata  P,  R or  T 
if  cacha-stata(addr)  » "R*  than 
cach*-$tata(add r)  :•  "R*  * 


1 


I 


4 

j 


mm**** 


lit 

<t—  if  wdiHtiMito)  - T*  than 
cadw-«tata(addr>  » V * 
fSti  | atata  T 

XMTFKTWE50)  i*  DONEtaddr) 
rtf  | atata  N 

XMTPKTT^SO)  m DONRaddr) 

| ♦♦♦♦♦♦  and  of  CMDt  proefrtnf  ♦♦♦♦♦♦ 

&£ 

m i*  trut  | packat  wm  from  MIM 


m tm  tnm  | mamoflaf  wm  off,  mutt  handfc  MEU  Input 
If  mthon 

| ♦♦♦♦♦  procaau  paetot  from  MEM1  ♦♦♦♦♦ 

item PCVPKT(MEM1), 

|f  liorn  - LOAO**^— ifiSD. 

Ijt  item  - LOAD^addr,  data,  raf,  tack 
if  cacha-atata(addr)  - "P*  than  | Know  it  la  in  cacha 
cacha-d«ta(addr) :«  data* 
cacha-raf(addr) rafj 
XMTPKT(RESO)  r«  itam» 

if  nwmoflaf  than  | can  aand  packat  at  MEMO! 
fiamoflaf  ;» ftitai  j yaa 
l and  CLRfaddr)  on  gort  MEMOi 
tacha  ataiafaddr) :» *R“ 

also 

ci che-atata(addr) *Q*  {no 
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rtf  if  cache-state(addr)  - *P  " then 
eache-data(eddr)  data* 
cache-rsf(addr) reft 
XMTPKT(RESO) item; 

it  nwmoftac  thjm  | can  sand  packet  at  MEMO? 
memoflat  t»  fat— i | yat 

sand  CLRfaddr)  on  port  MEMO  f 

cache-statefaddr)  »•  TT  " I 

fist  | 

cccha-atata(addr)  *«  ’XT " | no 

’! 

I 

tilt  | must  ba  stata  Q,  O',  R,  or  R* 

!t  Rem  • LOAD*(~, Jhgn  | updata  raf  and  sand  LOAO 
cache-ref(eddr) cache-ref(eddr)  ♦ 1) 
cacha-mod(addr) trust 

XMTPKTCRESO) :»  LOAD*(addr,  data,  cache-reKaddr),  tag) 
atsa  if  itam  - LOAO"(--, than 
cache-ref(eddr)  :•  cache-ref(eddr)  - 1; 
csche-mod(add r) trust 

XMTPKT(RESO) :»  LOAD"(addr,  data,  ceche-ref(addr),  tag) 

lilt 

VMTPKT(RESO)  t-  LOAD(addr,  data,  cachs<ref(addr),  tag) 
atsa  I must  ba  DONE 

tot  itam  - DONE(addr)t 

if  cache-state(addr)  - "R"  than  | know  it  is  in  cache 
cache -statetaddr) T 
alsa  | must  ba  stata  TT  * 

cacha-stata(addr) T| 

XMTPKT(RESO) DONE(addr)j 


| and  of  MEM1  processing  ♦♦♦♦♦♦ 
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APPENDIX  III  A 

Th*  Insertion  algorithm  for  tha  rotating  memory. 

flag  - fat—  | become*  true  if  TL  already  has  UPD  packet  for  this  address 

P i-  €a(X)>  | scan  pointer  ■ hash  address  initially 

if  RP  * TIP  and  P * RP  and  | hash  addr  • start  of  removal  region? 

«a(X)»  + €a(TL{TLP))>  or  a(X)  < aflKTLP)))  then 
TL(P) X)  | insert  Item  at  P 

RP  ?■  RP  ♦ 1 mod  kA  I shorten  the  removal  region 
pop  pop  ♦ 1 | update  TL  population 

fi!«* 

If  RP  p TIP  and  P c [RP.  TIP)  then  | hash  address  In  removal  region 
P t-  TIP  | advance  to  end  of  removal  region 

| repeat  until  find  empty  cell  or  enter  removal  region 

untH  (P  » HP  and  RP  * TIP/  oj.  TLPfP)  ■ empty  or  flag  - true  do 

( 

| see  if  TL  already  has  UPO  with  same  CCO  address 

If  *(X)  - a<TL<P))  and  TL(P)  - UPO then 
flag  :■  It 

else 

jf  («a(TL(P))>  - «a(X)>  and  aOC  < a(TL(P))) 

or  «e(X)>  « [ «a<TL(P)»  , P J | is  X "smailor*  thsn  th*  current  item? 

then 

Y :«  TUPX  j save  item  *rom  TL 

TL(P) X»  | insert  X here 

X :»  Yj  | insert  saved  item  in  next  ceil 

(which  pushes  everything  past  here) 


| find  out  whathar  to  Inaart  X or  procaaa  it  directly 

if  not  fiat  than  I inaart  It 

If  P ■ RP  and  RP  x TIP  | antarad  ramoval  ration? 


than 

TL(P) X» 

RP RP  ♦ 1 mod  M 
•!** 

TL(P)  s-  Xj 
pop pop+1 


| inaart  item  at  P 
| ahortan  tha  ramoval  ration 

| inaart  itam  at  P 
| updata  TL  population 


alta  | procaaa  it  (Kractly 

jet  TUP)  - UHXaddr,  data,  raf)i 
MX-  UPtX~,~,~)  than 

TL(P) X | anothar  UPD,  naw  ona  raptacaa  old 

alaa  if  X - FET(— .— ) than 

tet  X = FET{— ,tsg>,  j FET,  gat  tha  data 
XMTPKT(RESO) LOAIXaddr,  data,  raf,  ta|) 
alaa  If  X - FET *<—,— ) than 

let  X - FET*(“-,tatX  | FET+,  get  tha  data  and  updata  raf 
TUP) UPtXaddr,  data,  raf*l)» 

XMTPKT(RESO) LOAD+(addr,  data,  rof+1,  tat) 
also  if  X - FET'(— .— ) than 

lat  X - FET"(--,tagH  i FET*  gat  tha  data  and  updata  raf 
TUP)  >—  UPtXaddr,  data,  raf-l)j 
XWfTPKT(RESO)  s-  lOAD’taddr,  data,  raf-1,  tag) 
alaa  j muat  ba  CtR 

XMTPXT{RESO) DONRaddr) 
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APPENDIX  II!  B 

The  rotating  mamory  algorithm. 

procaw  starts  at  A 
input  porta  CMOI.  CCDI 
output  ports  RESO.  CCOO 

var  P,  X,  Z,  addr,  data,  raf,  tag,  CCD-addr,  pop  jnlt  0,  TL-cmd, 

OCD-data,  CCD-ref,  CCD-newref,  TIP,  RP 

array  Tl  tire  M 

Aj  jf  TKTLP)  • ampty  than 

RP  s-  TLFj  | destroy  the  removal  region 
white  TUTLP)  - empty  and  TLP  a €CCO-addr>  do 
( 

UP  :•  TIP  ♦ 1 mod  Mi  | advance  until  catch  up  to  CCD-addr 
RP «-  TLP  | keep  removal  region  destroysd 

H 

| look  for  input  packeta 
it  pop  c M - ! 

then  | TL  nearly  full,  can’t  take  packets  at  (Mil 
Z :■  RCVPKT(CCD!)j  | wait  for  and  accept  packet  at  CCDI 
tot  Z - ADDR< CCD-addr,  CCD-data,  CCD-raf); 

CCD-newref CCD-ref 

else  | can  accept  packet  on  either  port 

wait  for  packet  at  CMDi  or  CCDI,  *et  P :■  that  port  | nomJeterminate! 
If  P - ’CCP1’  then 

l RCVPKT<CCDlh  | accept  packet  at  CCDI 
tot  Z - ADORt  CCD-addr,  CCD-d«’„  CCD-ref); 

CCD-newref :«  CCD-ref 
else 


vw**Mivv»r*Ave*'  ^»cj s»  •*■ 
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X :-  RCVPKT(CMDIH  | take  packet  at  CMD1 

| ++++*+*++*++++++++++++****4++ 

| ♦ insert  or  otharwlaa  dispose  of  X 
| ♦ (from  appendix  li!  A ) 

| porform  alt  transactions  matching  CCD-addr 

whHoTKTLF)  w amoty  and  aOKTlP))  - OCO-addr  do 
( 

TL-cmd  :•  TL(TIP)»  | romova  transaction  from  list 
TKTLP)  t»  amptyi 

pop  i-  pop-1}  | updata  TL  population 

RP  :•  Ca(Tl-cmd)>}  | shorten  removal  region  appropriately 

TIP  f*  TIP  * 1 mod  Ml 
if  7l<i*d  « (XRfCCO-eddr)  then 
XMTPKTtRESO) 00f€(CC0-addr) 


also  if  TL-cmd  - FET(~,~)  then 


M « k»~V«M*l 


rrrf.ja. 

1 • bi\«wt|  ti|ft 


XMTPtCT(RESO) »»  LOADteddr,  CCO-date,  OCO-newraf,  tag) 


eise  if  TL-cmd  • FFT+(-v~)  than 
let  Tl-cmd  - FET+(»ddr,  tag); 

CCO-newref  :»  CCO-newref  ♦ 1; 

XMTPXr<RE$0} LOAD*(addr,  CCO-dat«,  CCO-newref,  tag) 


eise  if  Tl-cmd  - FET"(— ,~)  then 
let  TL-cmd  - FET”(addr,  tag); 

CCO-newref  ;•  CCO-newref  - 1; 

XMTPKT(RESO) L0A0'(addr,  CCD-data,  CCO-newref,  tag) 
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•tig.  i mu»t  b*  UPQ 

bti  - UPDfaddr,  d*t«,  r*f)» 

XMTPKT(CCDQ) WRITE(*ttr,  data,  ref) 


i r*wrH«  referent*  count  if  it  he*  chanted 

W CCO-ref  0 CCO-mw!r*f  than 

XMTPKT(CCOO)  :*  WRtTE(CCD-addr,  CCD-deta,  CCO-nowreffc 
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